» r 
» 
L 
~ = 
i 
a 
` “ 
re - 
r 4 a 
je e 
E ae - 
i 
; = . A 
Br ha. 


THEORY OF PROBABILITY 


THEORY 


OF 


PROBABILITY 


M. E. MUNROE 
Assistant Professor of Mathematics 
University of Illinois 


First Eprrion 


New York Toronto London 
McGRAW-HILL BOOK COMPANY, INC. 
1951 


iF 
i LN 


THEORY OF PROBABILITY 


Copyright, 1951, by the McGraw-Hill Book Company, Ine. Printed in the 
United States of America. All rights reserved. This book, or parts thereof, 
may not be reproduced in any hey permission of the publishers. 


Me pe 


Bureau £dni. Psy. Research 
DAVID HARE TRAINING COLLEGE 


Dated... se RB. 
Ates, No... (Gl. 


THE MAPLE PRESS COMPANY, YORK, PA. 


PREFACE 


This book is the outgrowth of an undergraduate-graduate course 
that the author has offered for the past few years at the University 
of Illinois. 

The modern theory of probability is based on the Lebesgue-Stieltjes 
integral and so requires a rather extensive background in analysis for 
a rigorous presentation. However, a formal development of this 
theory should be intelligible to students with considerably less prepara- 
tion, and this book is intended to be just such a formal summary. We 
have attempted to keep the analytical level of the book down to the 
point where one year of calculus can serve as a prerequisite for the 
course. On the other hand, we have not hesitated to include impor- 
tant theorems that cannot be proved on this level. We have tried to 
be honest with the student by pointing out the lapses in rigor that 
necessarily occur because of the nature of our undertaking. 

The true flavor of modern probability theory is best seen in the 
limit theorems; therefore Chapters 2 to 6 have been designed primarily 
as an introduction to Chapters 8 to 11. Stochastic variables are 
introduced almost simultaneously with the notion of mathematical 
probability, and the discrete and continuous cases are treated side by 
side throughout. This is an unorthodox arrangement for an elemen- 
tary text, but it seems desirable in view of our aim to push beyond the 
problem course in permutations and combinations that is often 
regarded as an introduction to probability theory. 

A large portion of the material in this book can be found, in slightly 
different form, in Cramér’s Mathematical Methods of Statistics and in 
Uspensky’s Introduction to Mathematical Probability. The author 
takes this opportunity to acknowledge his indebtedness to these sources. 

Professor J. L. Doob has been very kind in undertaking to criticize 
the manuscript in various stages of preparation. His suggestions have 
strengthened the work immeasurably. ` 

i M. E. Munro: 


URBANA, ILL. 
November, 1950 


CONTENTS 


PRIECAGBE co 2.2 4 Hw eS Ae aw woe ak KO ee Oe TET 


1. 


PERMUTATIONS AND COMBINATIONS. ........... 
Formulas for "P, and "C,. Example—Poker Hands. Binomial Coeffi- 
cients. References for Further Study. Problems 


ple. Example—Two Dice. The Axiomatic Definition. Stochastic 
Variables. The Discrete Case. Example—Balls from an Urn. The 
Continuous Case. Example—Bombardment of Hemispherical Screen. 
Example—The Radium Atom. Synthesis of Discrete and Continuous 
Cases. References for Further Study. Problems 


Dice. Conditional Probabilities. The Multiplication Theorem. Bayes’ 
Theorem. Construction of Joint Density Functions. Independent 
Stochastic Variables. Joint Density Functions—Independent Case. 
Example—Three Urns. Example—Points Chosen at Random. Exam- 
ple—The Buffon Needle Problem. Example—Genetics. References for 
Further Study. Problems 


. REPEATED TRIALS AND ALTERNATIVE EVENTS... 


Bernoulli Formula—Physical Version. Bernoulli Formula—Stochastic 
Variable Version. Example—Random Walk Problem. The General 
Addition Formula. Example—The Matching Problem. References 
for Further Study. Problems 


. MORE ABOUT STOCHASTIC VARIABLES. . . 


Functions of Stochastic Variables. Translation and Change of Scale. 
Representation of Physical Situations. Sums and Products. Convolu- 
tions. Normal Distributions. References for Further Study. Problems 


. MOMENTS. . 


Moments of a Stochastic Variable. Example—Normal Distributions. 
Example—Cauchy’s Distribution. Example—The Radium Atom. Ex- 
pectations. Example—Balls in an Urn. Example—The Matching 
Problem. Variance. Example—Bernoullian Trials. Covariance and 
Correlation. Normal Correlation. References for Further Study. 


Problems 
vii 


10 


34 


61 


76 


97 


viii CONTENTS 


sw & « 3 122 
The Beta and Gamma Functions. The 0, o Notation. Stirling's 
Theorem, Complex Exponentials. The Integral of Sin s/z. Refer- 
ences for Further Study. Problems 
S. LIMIT THEOREMS... cas a BA 
The Central Limit Law, The Poisson Distribution. The Laws of Large 
Numbers. The Law of the Iterated Logarithm. References for Further 
Study. Problems 
er 161 


iscussion, Examples — 
Example—Sam- 
Functions. Appli- 


d l . Example—Cauchy’s Distribution. 
a Case—Detniled Proof, References for Further Study- 

9 

10, THY POISSON DISTRIBUTION iy 


The Binomial Limit. Approximation to Bernoullian Propabilipie, 
Comparison with the Normal Law, Example—Telephone Trunk oe 
Kxample—Counting Bacteria. Exact Poisson Distributions. cet 
Two Views of the Same Problem. 
roblems 


11. THE LAWS OF LARGE NUMBERS 


ase ications 
Proof of the Weak Law, Convergence in Probability. Anpi aot 
to Sampling. Example—A Numerical Problem. The a Poste 


i ye sions. 
Approach. Proof of the Strong Law. Example— Decimal Expansi 
References for Further Study. Problems 


_ a 
INDEX Ratha 


as 


te 


CHAPTER 1 
PERMUTATIONS AND COMBINATIONS 


This chapter contains only a very brief summary of the subject of 
permutations and combinations. It is designed as a review rather than 
as an introduction to the subject. For the student who finds this 
chapter inadequate, we have listed at the end several references in 
which he will find a much fuller treatment. In addition to these refer- 
ences, we should also recommend to the student the book from which 
he studied college algebra. 

We take up permutations and combinations, not because these 
topies form an integral part of the theory of probability, but because 
they are useful tools in solying many problems that illustrate the funda- 
mental ideas of probability theory. The emphasis in later chapters 
will be on these fundamental ideas, and the problems assigned there 
are designed to help the student understand these ideas. But if a 
specific example is to be of any help to the student in understanding a 
general principle, he must be able to work the example without wasting 
all his energy on mechanical details. The student who is weak in 
arithmetic will have trouble with principles of accounting; the one who 
is weak in French grammar will get lost trying to study French litera- 
ture. Similarly, the main body of this course is apt to elude the 
student who neglects the more or less mechanical aids such as those 
found in this chapter. 


1. Formulas for "P, and "C;, 


Suppose there are n objects, distinguishable one from the other in 
some way. If r of these n objects are laid out in some order, we say 
that this procedure describes a permutation of the r objects. If we 
count all the different permutations thus obtainable from the entire set 
of n objects (permutations being considered different if they are 
described by different sets of r objects or if they are described by differ- 
ent methods of ordering the same 7 objects), we refer to the result as 
the “number of permutations of n things r at a time” and denote it by 
the symbol "P,. 

The formula for "P, in terms of n and r may be obtained by the fol- 


1 
¥ 


2 THEORY OF PROBABILITY [cHap. 1 


lowing device: Let us think of a permutation as described by taking a 
box with r compartments (numbered from 1 to r) and filling it by plac- 
ing one of the n objects in each compartment. Now, there are 
obviously n different ways of filling the first compartment. For each 
of these n ways, there are n — 1 ways of filling the second compart- 
ment. Hence, there are n(n — 1) different ways (different in the 
sense that they will lead to different permutations) of filling the first 


two compartments. Continuing in this manner, we arrive at the 
familiar formula: 


"P, = n(n — 1)(n — 2) += - (n —r +1). 


A neater way of expressing this formula is obtained by means of a 
device the student should learn to use. The expression for "P, is very 
similar to n!. The only trouble is that the first n — r factors are miss- 
ing. If we interpret “missing” to mean they have been divided out, 
we have 

a. — n! 
a) P, = gu 


We describe a combination of n things r at a time by choosing a set 
of r objects from the parent set of n objects and disregarding the order 
(if any) of those chosen. Thus, for every combination of n things 
r at a time there are "P, = p! permutations. Hence, the number of 
permutations is r! times the number of combinations. In symbols: 


aP, = rl *C,, 


So, we divide (1) by r! and obtain the formul 


D; i a for the number of com- 
binations of n things r at a time: 


(2) "Q, = n! , 
r! (n — r)! 

; The combination numbers "C, appear in m: 

m mathematics, and unfortun 

symbol for them. We shall st 


any different connections 
ately there is no universally accepted 
ick to the symbol "C,. Others in com- 
n 
mon use are 0? A 

are Ch, d nC, C(n,r) and Cn Uspensky’s Introduction to 
Mathematical Proba which we recommend again and again as 


collateral readin 
g r and n revers i . 
a oe wots n reversed. The thing we call ”C, 


se of this wide diver sity of ati 
sti 7 js sity of notation, the 
udent would do well to check carefully on the author's cony ti 5 in 
each new work he reads, nvention 


bility, 
has the 


sec. 1] PERMUTATIONS AND COMBINATIONS 3 


An interesting arrangement of the numbers "C, is given by Pascal’s 
triangle: 


1 
1 1 
| 2 1 
1 3 3 1 
1 4 6 4 1 
1 5 10 10 5 i. 
1 6 15 20 15 6 1 


Successive rows represent successive values of n, and in each row the 
entries read from left to right are the values of "C, given by successive 
values of r. We have listed above the entries for n = 0, 1,2, ... ,7. 
An interesting feature of the Pascal triangle is that each entry is the 
sum of the two above it (see Prob. 18 at the end of this chapter). In 
many cases this rule furnishes an easy way of computing combination 
numbers. 

The student will notice that each row of the triangle begins with an 
entry corresponding to r = 0. Furthermore, for each n, we have set 
CO) = 1. This is quite consistent with the formula. If we note that 
(n — 1)! = n!/n, then it seems reasonable to say that 0! = 1!/1 = 1; 
and (2) yields 
_ wl _. 
= ki” 


aCo L 
The inclusion of "Co obviously makes for symmetry in Pascal’s triangle, 
but we might question its significance in terms of combinations. It 
counts the so-called vacuous combination—the process of making a pass 
at the set of n objects, but not taking any. In some problems we shall 
want to count this possibility; in others we shall want to exclude it. 
The student must learn to distinguish between combinations and 
permutations. The distinction lies in whether or not the order of 
objects chosen is to be considered. For example, there are 24 per- 
mutations of the numbers 1, 2, 3, 4 taken three at a time: 


123 124 134 234 
132 142 143 243 
213 214 314 324 
231 241 341 342 
312 412 413 423 


321 421 431 432 


4 THEORY OF PROBABILITY fenar. 1 


However, there are only + combinations—one indicated by each 
column above. All entries in a given column are merely rearrange- 
ments one of the other, and these do not constitute new combinations. 
Sometimes textbook problems “In how many ways can so and so be 
done?” leave room for doubt as to whether combinations or permuta- 
tions are to be counted. For instance, we have given four answers to 
Prob. 6 at the end of this chapter. Fach represents a different inter- 
pretation of the problem. Some of the interpretations are rather 
artificial, but each is theoretically justifiable, and the student should 
determine what interpretation is used to arrive at each answer, 

To anticipate for a moment, probability will be computed in many 
simple examples by counting ‘favorable cases” and “possible cases” 
and dividing the former by the latter. The cases under consideration 
will frequently be combinations or permutations, and often we can 
analyze the problem in either way. In such problems we must take 
care to count the same thing in both numerator and denominator. 

Occasionally, the student will want to use the argument we employed 
to find "P,. That is, he will reason that, if a first step can be taken in 
nı ways and for each of these a second step can be taken in nz ways, 
then the two steps can be taken in nns ways. In general, this type of 
argument leads to a count of permutations. However, there are excep- 
tions to this rule. The above argument will lead to a count of com- 
binations in case the results of the two steps are not permutable, i.e., in 


case the two steps are of an essentially different nature so that no com- 
binations are counted twice. 


2. Example—Poker Hands 


Perhaps an illustrative example will serve to cl 
Let us find the number of 5-card poker h 
types: 


arify this last remark. 
ands of each of the following 


(a) One pair (other 3 different) 
(b) Two pairs (1 odd card) 

(c) Three of a kind (2 odd cards) 
(d) Full house (3 of a kind and a pair) 
(e) Four of a kind 
Here, presumably, we are to count 


5-card combinations. At any 
rate, let us proceed to do that. 


(a) There are 52 w. 


ays of choosi 
there are 3 ways of oosing the first card. For each of these 


completing the pair. Now, if we say there are 


SEC. 3] PERMUTATIONS AND COMBINATIONS 5 
52 X 3 ways of getting the pair, we shall be counting every combina- 
tion twice. For instance, the ace of spades is one of the 52 possibilities 
for the first card; and the ace of hearts is one of the 3 possibilities that 
goes with it. However, the ace of hearts is one of the 52 possibilities 
for the first card; and the ace of spades is one of the 3 that goes with it. 
Thus, in the 52 X 3 ways of getting a pair, we have counted (AS,AH) 
and (AH,AS); but these 2 ways lead to the same pair. The same is 
true of every other pair; so the number of 2-card combinations contain- 
ing pairs is 52 X 35. Now, to fill the rest of the hand: For every pair 
there are 48 ways of getting a third card which is different. For every 
such choice of the first 3 cards, there are 44 allowable choices for the 
fourth. Similarly, we are left with 40 choices for the fifth card. 
Again, straight multiplication yields a count of the permutations of the 
last 3 cards; so it is necessary to divide by 3!. Now, there arises the 
following question: There are 52 X 33 two-card combinations which 
are pairs; for each of these there are 48 X 44 X 40/3! combinations of 
3 odd cards. If we multiply these two numbers together, do we not 
again count permutations? The answer is no, because these things do 
not permute. We count no 5-card combinations twice by listing the 
pair first and the 3 odd cards second. ‘Thus, the answer to part (a) is: 


52X3 48 X44 X 40 
A 3! i 


(b) Here we count the pairs in the same way. There are 52 X 34 
ways of getting the first pair. For each of these, there are 48 X 34 
ways of getting the second pair. But these pairs permute. To multi- 
ply these two numbers together and quit would be to count aces and 
kings and also kings and aces. The result must be divided by 2 again. 
Finally, for every 2-pair combination there are 44 ways of choosing the 
odd card. The odd card does not permute with the pairs; so multipli- 
cation by 44 yields the total count of 5-card combinations containing 2 
pairs. The answer is: 

52X3 48X3 
2! 2! 


aT 44. 


These tivo cases illustrate all the difficulties that arise. We leave 
the remainder of the example as an exercise for the student. 


3. Binomial Coefficients 


So much for permutations and combinations as such. Before mov- 
ing on, we should look at some of the other uses of the combination 


6 THEORY OF PROBABILITY {cHar. 1 


numbers ”C,. Perhaps their most important appearance is in the 
binomial theorem: 


n 
1 a+b = Y Carb, 

0) ( ) PA 

Because of the role they play in this important theorem, the numbers 

“C, are frequently referred to as binomial coefficients. In fact the 

student will find that this is the name usually given to thesenumbers. 

The binomial formula (1) is an identity ina and b. That is, if a and 
b are considered as independent variables, the function on the left is 
equal to the function on the right at every point in the ab plane. 
Clearly, if each side of such an identity is added to or multiplied by the 
same quantity, the result is another identity. Furthermore, differ- 
entiation with respect to cither of the variables yields another identity. 
(Note that integration may introduce an extraneous constant.) 

Since (1)—or any identity derived from it—holds over the entire 
ab plane, it holds at any given point or along any given curve. Thus, 
the substitution of specific values for a and b or substitution from an 
equation relating a and b will yield a correct result. To illustrate the 


use of these techniques for getting information from (1), let us set 
a@=b=1. The result is 


n 
(2) Vong, = Qn, v 
hod 
r=0 
This says (among other things) that the number of all combinations of 
n things (including the vacuous one) is 2», 
If we take the partial derivative of (1) with respect to a, we have 
n 
nla + b)"-1 = 2 r "C, ambir; 
r=0 


Multiply by a: 
na(a + bjt = ` r Carb»; 
à 


and consider the special case a + p 


(83) 


= lk 


shall see later. The method 
: ; e commonly empl i zi tati 
with binomial toefciante y employed in making computations 


PERMUTATIONS AND COMBINATIONS T 


REFERENCES FOR FURTHER STUDY 
Hall and Knight, Higher Algebra, London (1936), Chaps. XI, XII, 
XIII. 
Levy and Roth, Elements of Probability, Oxford (1936), Chaps. III, 
VIL. 
Whitworth, Choice and Chance, New York (1927), Chaps. I, II, IIT. 


PROBLEMS 


vi. How many Greek-letter fraternities can be given distinct names 
of 3 differentletterseach? (Thereare 24lettersin the Greek alphabet.) 
Ans. ™P3. 

“2. The same as Prob. 1, but the letters in a name need not be 
different. Ans. 248, 
3. How many fraternity names are there with either 2 or 3 letters, 

all different? with repetitions? Ans. “Pa + “Py; 24? + 248, 
4. In how many ways may a party of 10 people be seated in a row? 

at a round table? Ans. 10!, 9!. 
5. In how many ways may a party of 5 couples be seated at a round 
table with men and women alternating? Ans. 5! 4l. 
6. In how many ways can 52 cards be dealt into 4 hands of 13 cards 
each? Ans. 521, 521/(13!)4, 52!/4!, 521/41(13!)4, 
7. How many different 13-card hands can be obtained from a 
52-card deck? Ans. (C43. 
8. How many of these 13-card hands will contain a 7-card suit? 
Ans. 4-1307- 30.. 

9. How many 5-card poker hands contain three of a kind, a full 


house, four of a kind? (See Sec. 2.) Ans. 54,912; 3,744; 624. 
10. How many integers less than a billion contain five 7's? 
Ans. gesi, 


11. How many integers less than a billion consist of 1’s and 2’s only? 
Ans. 2-92, 
12. How many of the integers in Prob. 11 contain three 1’s? 


Ans. 210. 

13. How many different ordered pairs of numbers can result from 

the throwing of two dice? Ans. 36. 
14. How many of the results in Prob. 13 give a total of 7? 11? 

Ans. 6, 2. 

“15. How many 5-man basketball teams can be chosen from a squad 

of 12? Ans. 792. 


16. If the 12-man squad in Prob. 15 consists of 3 centers, 5 forwards 
, 


8 THEORY OF PROBABILITY [enar. 1 


and 4 guards, how many teams (1 center, 2 forwards, 2 guards) can be 
chosen? Ans. 180. 
17. Show that "C, = "Cnr: 
18. Show that "+10, = "C, + "C1. 
19. Use the identity in Prob. 18 to prove the binomial theorem by 
mathematical induction. 
20. Show that 


2 (1) *C, = 0. 


21. Show that 


dh a BP 
riin—r)! nl 


r=0 


22. Show that: 


n 


3 r? "C,at(L — a)” = na(l — a) + (na). 


r=0 
23. Show that, for 0 < k < n, 
k 
nO, Chin = Ch, 


r=0 


Hint: (1 + 2)9(1 4 x)" = (1 + z)”, 
x* on each side of this equation. 


24. Show that, 


Compute the coefficient of 


> ("C,)2 = 2G), 
r=0 
25. Show that, for 0 < 2% SA 


2k 
p (1) "C, "Oar = (—1) Cy, 
r=0 
Hint: (i+ z)"(1L— r)» = (1 — 2%), 
26. Show that (except, perhaps, for an additive constant) 
k 
p Car — ar C aC [iea = pra. 
Hint: 


Differentiate with respect to x 


PERMUTATIONS AND COMBINATIONS 9 


27. Show that (except, perhaps, for an additive constant) 
k 
1 
X Oal — a) = (n — K) "Or fea- oma. 


r=0 


Note: The constant of integration left unaccounted for in Probs. 26 
and 27 is actually zero in each case. See Prob. 12, Chap. 7. 


CHAPTER 2 
MATHEMATICAL PROBABILITY 


To the pure mathematician, probability is merely Aon hag 
fying certain axioms. Later in this chapter we shal ee a 
axioms that will serve as a basis for the development ka t ia 
matical discipline known as the “theory of probability.” First, 


s aanl 
ever, we should give some attention to the question of the physica 
significance of probability. 


4. An Elementary Definition 


Probability might be described as a “measure of engl saa 
is, the probability of a physical event will be a number which desc a à : 
in accordance with certain fixed conventions, the likelihood of occu 
rence of the event. — 

For any such numerical measurements we have to agree on a sct h 
and the standard convention is that probabilities range from 0 to J 
with impossible events assigned probability 0 and logically ce 
events assigned probability 1. Furthermore, we want the intermedia sf 
probabilities assigned in such a way that, the more likely an event 15, 
the greater will be its probability 

This last result can be accomplished in many w: 
tempted to be more explicit by saying tha 
portional to likelihood. Howeve 


ays, and we might be 
t we want probability pro- 


r, likelihood is too abstract a ounce 
for this to make much sense. We get a more satisfactory picture by 


. $ at 
considering the frequency of occurrence of an event. Let us say tha 
the probability of an event will be 


proportional to the frequency with 
which we should expect the event to occur. Clearly, the factor O 
proportionality (to give probabilities ranging from 0 to 1) is the 
total number of opportunities for occurrence of the event. So we 
might say that the probability of an event is its expected relative 
frequency. 
The student must n. 
tive frequency” with 


ot confuse what we have called “ 
or future). The que 


observed frequencies of occurrence 
stion of the relationship between p 
0 


expected rela- 
(past, presents 
robability ar 


sec. 4] MATHEMATICAL PROBABILITY 11 


experimental frequency is one that can be discussed more intelligently 
at the end of a probability course than at the beginning. We shall 
return to it in Chap. 11. All we want to point out here is that if there 
is any convincing reason for expecting a certain event to happen with 
a certain relative frequency, then it is that relative frequency (not its 
square, for instance) that we should take as the probability of the given 
event, 

The usual procedure in determining the proper assignment of prob- 
abilities to physical events is to begin with the simplest cases, where the 
proper assignment is obvious, and develop rules and formulas to 
describe more complicated cases in terms of these simpler ones. 
Accordingly, we begin with a definition that covers the simplest type 
of physical situation. 


Definition 1. If an experiment can produce n different results all 
of which are equally likely and if r of these results are defined as favor- 
able, the probability of a favorable result is r/n. 


Now, equally likely results have the same expected relative fre- 
quencies. It does not matter whether we regard this as an obvious 
statement, a definition of equal likelihood, or a definition of expected 
frequency. The point is that because of this fact (or definition, or 
what have you) Definition 1 is consistent with the idea that probability 
be proportional to expected frequency. 

While Definition 1 is inadequate as a basis for the development of 
the mathematical theory of probability, it is essential that we have 
something of the sort to relate the notion of probability to that of likeli- 
hood. The theorems and formulas of mathematical probability theory 
are best developed from a set of axioms (see See. 8), and that is the way 
we shall proceed. It would be very unsatisfactory, however, to have a 
purely abstract theory with no relation to physical events at all; and 
Definition 1 (and its twin brother Definition 3—see Sec. 12) serve as 
important connecting links between the mathematical theory and the 
physical world. Furthermore, these physical definitions (wherever 
they apply) give probability the physical meaning we should like it to 
haye—a measure of likelihood. 

We are still faced with the (essentially psychological) problem of 
determining when events are equally likely. We have suggested 
equally likely events as the simplest physical situation because they 
are the easiest to spot with a reasonable degree of assurance. Funda- 
mentally, what we do when we apply Definition 1 is to assume the 


12 THEORY OF PROBABILITY jenar. 2 


equal likelihood of certain events. We should try to make these 
assumptions reasonable, but logically they have to stand as assump- 
tions only. There is no way of proving them. 

The necessity for making assumptions in order to fit a physical 
problem to a mathematical formula is not peculiar to the calculus of 
probabilities. In almost any calculus book one ean find the problem, 
“Find the work done in filling an upright cylindrical tank a feet in 
radius and h feet in height by pumping water in through an inlet in the 
bottom.” At this point, one blithely writes 


W = 62.4 i marx da. 


Now no physical tank is a perfect cylinder; the inlet presents an addi- 
tional irregularity; water is not quite a perfect fluid; ete. However, if 
We assume that none of these irregularities exists, then the mathe- 
matical theory of the definite integral guarantees that the above is the 
correct answer. 

Probability is not alone among the branches of mathematics in that 
assumptions are necessary for its applications to the physical world. 
Its only claim to fame in this respect is that in probability problems 


the experts disagree more violently and argue more loudly about the 
assumptions to be made. 


5. Example—Roulette 


As an example of t 
us study the game o 
there, but the stand, 
with 37 equally 


he type of situation described in Definition 1, let 
f roulette. There are minor variations here and 
ard game (at Monte Carlo) is played with a wheel 
Son spaced slots, numbered from 0 through 36. In addi- 

fk; © a number, each slot has a color. The zero is green, and the 
ed are either red or black (18 of each), Actually, the colors alter- 
3 Tehei Hew, but the numbers are irregularly placed in er 
are red and the He 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, E 
rolled around it, y h wang pihe wheel is spun, and a Ma . 
the ball falls Silo tui Wheel and ball have slowed down sufficiently; 


. ae . e 
slot into which it falls, t; and bets are then paid off on the basis of th 


The educational fe: 
u 1 ature of t 
Way in which bets may be comet 
follows: 


i èar es, $US 
game from our point of view is t! 


é : m- 
ers are arranged as There is a board on which the nu 


SEC. 5] MATHEMATICAL PROBABILITY 13 


0 


i g% g 
fo a ¢ 
2 S p 
1 ll p 
i3 ls I5 
16 17 18 
19 2 21 
22 3 A 
= 20 2 
2% 29 30 
31 32 33 , 
= Sbi 88 


The colors are indicated on the board too. We have done this by 
underlining the red numbers. Bets are made by placing chips at an 
appropriate place on the board. One may bet on a single number, any 
2 adjacent numbers (either across or down, not diagonally), any square 
of 4 numbers, a combination of zero with any 1 or with all 3 of the num- 
bers adjacent to it, any row of 3 numbers, any 2 adjacent rows, or any 
column of 12 numbers. (The zero is not considered as being in the 
middle column.) In addition to the above number array on which 
these bets may be indicated, the board provides spaces on which one 
may indicate bets on the numbers 1 to 18 (passe), 19 to 36 (manque), 
the even numbers, the odd numbers, the red numbers, the black num- 
bers, and each of the three dozens 1 to 12, 13 to 24, 25 to 36. 

It would seem reasonable to assume that the 37 numbers are equally 
likely (there is no way of proving this mathematically); therefore we 
may apply Definition 1 to find the probability of winning any particu- 
lar bet. The point we want to stress here is that the description of the 
game furnishes us with our situation (37 numbers) and our law of 
probability (147 times the number of numbers covered by the bet). 
The definition of “favorable” (z.e., the particular bet we want to talk 
about) has nothing to do with this. However, having described the 
situation and formulated the law of probability, we can immediately 
compute the probability of winning on any of the bets listed above. 
The probability of winning on red is 1837; on even, 1847; on the 


14 THEORY OF PROBABILITY [cuar. 2 


first column, 1247; on the first two rows, 637; etc. There are 155 
different bets covered by the discussion we have given; so, with a sin- 
gle description of a situation and a probability law, we solve 155 differ- 
ent problems. As a matter of fact, from a purely theoretical point 
of view, we solve a lot more than that. In Chap. 1 we saw that the 
total number of combinations of n things, including the vacuous one, 
is 2". We now exclude the vacuous combination and sce that there are 
2% — 1 = 137,438,953,471 different bets imaginable, though only 155 
of these are accepted at Monte Carlo. Clearly, each of these hundred- 
odd billion problems is solvable immediately from the law of prob- 
ability we have formulated. 

All this is typical of probability problems in general. The mathe- 
matical formulation should lead to a law which gives the probability 
of every event connected with the given situation. Usually not all 
this information will be needed in any given study; nevertheless, it 
will be available if the problem has been correctly formulated. 


6. The Addition Principle 


Suppose, in the roulette game, we place two bets on the same spin of 
the wheel. Suppose, further, that no number is covered by both bets. 
For example, we might bet on the first and third columns, not on the 
first column and the second row. This latter combination covers the 
4 twice. Under these circumstances, it is easily seen that the prob- 
ability of winning one or the other of the two bets is equal to the sum 
of the probabilities of winning cach of the two separately. 

This addition principle is a property of probabilities in general. I> 
fact, for situations covered by Definition 1, it is almost obvious. 

E, is an event consisting of rı results and Es is an event consisting of 72 
results from the same experiment and if no one of the n possible results 
18 contained in both eve 


nts, then the two events together consist 0 
rı + rz results. 


So, by Definition 1, the probability of one oF the 


other is 
BI + T2 ri Ta 
n n n’ 
but the right-hand side jc sate 
vane. 8 d side is the sum of the probabilities of the sepa! at 


T. Example—Two Dice 


Before proceeding to formal 


; cues jde” 
another specific example: the mathematical definitions, let us cons 


problem of the total thrown on two dice 


SEC. 7] MATHEMATICAL PROBABILITY 15 
We might begin by saying that the situation is described by noting 
that there are 11 possible results: totals ranging from 2 to 12, inclusive. 
This is a legitimate description of the situation in that any event con- 
‘cerned with totals (for example, the total is 2, the total is 2 or 9, the 
total is odd, ete.) will be some combination of these 11 results. How- 
ever, Definition 1 is not directly applicable because we can find no 
legitimate excuse for calling these results equally likely; but we can 
use Definition 1 and the addition principle to formulate a law of prob- 
ability to go with this 11 result situation. 

On breaking the problem down further, we find results of a different: 
kind which seem to be equally likely. These are the ordered pairs of 
numbers indicating the results on the individual dice. There are 36 
of these, and they may be arranged in a very suggestive manner as 
follows: 


(1,1) 

(1,2), (2,1) 

(1,3), (2,2), (3,1) 

(1,4), (2,3), (3,2), (4,1) 

(1,5), (2,4), (3,3), (4,2), (5,1) 
(1,6), (2,5), (3,4), (4,3), (5,2), (6,1) 
(2,6), (3,5), (4,4), (5,3), (6,2) 
(3,6), (4,5), (5,4), (6,3) 
(4,6), (5,5), (6,4) 

(5,6), (6,5) 

(6,6) 


From this array, it is clear that Definition 1 gives us the following 
results: 


Total: 2 3 4 10 11 12 
Probability: 145 245 346 436 536 846 546 436 336 246 1e 


or 
N 
vo 
=) 


Now the addition principle gives us a law of probability applicable to 
all events admissible under the 11-result situation: To find the proba- 
bility that the total is any one of a set of numbers taken from the set 2, 
3, ... , 12, add the probabilities given by the above table for the 
individual numbers in the set under consideration. Using this law, 
we find, for example, that the probability that the total is 5 or less is 


146 + 286 + 346 + 486 = Ms. 
The probability that the total is odd is 


236 + 436 + $36 + 436 + 246 = 14. 


HAP. 2 
16 THEORY OF PROBABILITY [cHap. 2 


Probably the most important observation to be made a 
this example is that the general law of probability came direct y ni 
the table of probabilities for the individual numbers. In this nane i 
table was obtained by using Definition 1; but, no matter how : 3 
arrived at such a table, it would still lead directly to a general formula- 
tion of the problem. 


8. The Axiomatic Definition 


In discussing roulette and dice we have chosen particular points for 
emphasis in order to call attention to the things that a pees ee 
expects of a law of probability. We might summarize these ideas as 
follows: As a basis for a discussion of probability there must be a funda- 
mental set of “things” (numbers, number pairs, “results,” or what 
have you). Every subset of this fundamental set will be called en 
event. Then, there should be a law of probability (usually called the 
probability distribution function or, frequently, simply the distribution 
function) which defines probability for each event. The ndditipa 
principle should hold for this distribution function. e 
the probability of each event should be a number between 0 and 1, ant 
the probability of the entire undamental set should be 1. À 

These properties of the e aa function are the ones oiinanit 
used to give a purely abstract, axiomatic definition of mathematica 
probability. Let us formulate them a little more precisely in the 
language of pure mathematics. 

Let us think of the fundamental set of results as 
other words, let us construct a mathematic 
situation in the form of 


a set of points. In 
al model of the physical 
a point set, each point of which represents one 
of the possible results of the experiment. The set of all such points tO 
be considered in any given problem we shall call the event space. We 
shall designate this event space by 5S. 
In the examples we have seen so far, f 
finite point set. In the roulette problem § contained 37 points; 1% 


the dice problem, 11 points, Indeed, in any problem directly covere 
by Definition 1, § will cont: 


ain exactly n points. However, there i 
y mathematical model in this fashion; 2 

it is here that we begin to transcend the elementary physical definition- 
Let us say, then, that an event space is any point set, finite or infinite- 
The next step will be to define events as subsets of S; but, in la 


on a 
the event space S has been 


no need to restrict our purel 


ey, 
Space has infinitely many points, not every subset 


In the terminology of modern integratio® 


sec. 8] MATHEMATICAL PROBABILITY 17 


theory, it is only the measurable subsets of S that we want to consider. 
This is no place to discuss the measurability of point sets. Instead, 
let us content ourselves with the following comments: The space S may 
contain: 


(a) A finite number of points 

(b) An infinite sequence of points—for example, the set of integer 
points on the real line 

(c) A continuum of points—for example, the set of all points on a 
line or the set of all points in some line interval 


In the first two cases, S is called discrete. In a discrete space all 
subsets are measurable. In the continuous case this is no longer true, 
but a nonmeasurable set is a very weird sort of thing—indeed, so 
unusual that it is of consequence only in advanced theoretical consider- 
ations in probability. 

To return to our main line of discussion, we define an event as a 
measurable subset of S. Two events, Æ, and Zs, will be called mutually 
exclusive if they have no points in common. A (finite or infinite) col- 
lection of events will be called a collection of mutually exclusive events 
if no two of the events have a point in ommon. The complement of 
E, denoted by &, is the set of all points of S not contained in Æ. If 
E, and Es are two point sets, we shall denote by E, + Fe the set of all 
points belonging to either 2, or Æe (or both); we shall denote by F,Ee 


E+E 


= Ne, 
f (37 Fii 


m 


 & & j 


Eg 


Fie. 1. 


the set of points belonging to both #, and E». Naturally enough, the 
sets Kı + Hz and EEs are called the sum and product, respectively, of 
E, and E». We shall use SZ, and IZ, for the sum and product of any 
(finite or infinite) number of sets. The student should hardly need to 
be cautioned that addition and multiplication symbols may be applied 
to both sets and numbers in the course of a single formula and that 
their meaning depends on the type of thing to which they are applied. 

At this stage, it might be a good idea to draw a parallel between the 


18 THEORY OF PROBABILITY [cnap. 2 


terminology of events considered as point sets in event space and that 
of events considered as results of a physical experiment: 


Point Sets in Event Space Results of a Physical Experiment, 
E is vacuous (E = 0) E is logically impossible i 
E=8 E must occur 
Ly + E: The result is either W, or Ee 
EEs 


The result is both 2; and E» 

E, and Es are mutually exclusive W, and Ee are logically incom- 
(E:F: = 0) patible. If one occurs, the other 

i does not 

_# E does not occur 

Ez is a subset of Ly If E: occurs, so does By. My im- 

plies 2; 


Axiomatic Definition of Mathematical Probability. Given an event 


space S, pr{Z} will be called a probability distribution function for 
S if it satisfies the following axioms: 


A. pr{E} > 0 for every event E of S. 
B. pr{S} = 1. 


A ; ess ee 
C. If By, Ba, Bs, . . . is any (finite or infinite) sequence of mutually 
exclusive events of S, then 


pr{ZE,} = Spr{L,}. 


These axioms are, for the most part, a repetition of the informal com- 
ments we made before, but in axiom C something new has been added. 
This is the fact that the addition principle should hold for infinite as 
well as finite collections of mutually exclusive events. So far as this 


course i 5 4 i i 
fini Se 18 concerned, our most common use of axiom C will be as & 
inite addition principle. 


Occasionally, however, we shall want to 

a E- : 5 , > > she a 
al n = Sequences of mutually exclusive events; and the 
Eene fe d bear in mind that this extended addition principle 1S 

S rac atisfactor ye 
theory, completely satisfactory development of probability 
For a given sp: ‘ 

tion i ma È there are many ways of constructing a distribu- 
set of ‘tiers, 3 P} satisfying the axioms. For example, if S is the 
cach point an = a `- > 12, we may assign a probability of 241 t° 
tory distributic n he addition principle to get a completely satisfac- 
dh, a - unction. On the other hand, we have already seen 
signment of probabilities we made in connection with the 


sxc. 9] MATHEMATICAL PROBABILITY 19 


dice game accomplishes the same thing in a different way. The 
axioms merely tell us what a function must be like in order to be classed 
as a distribution function; the construction of the function in a specific 
© problem will depend on the physical interpretation to be made of the 
mathematical model. But more of this later; first, let us look at some 
properties of the distribution function that are easily derived from the 
axioms. 


Theorem I. If S isan event space and pr{£} is a distribution func- 
tion for S, then’ 


(a) pr{Ē} = 1 — pr{E} for every Æ in S. 

(b) The probability of the vacuous set is zero. (pr{0} = 0.) 

(c) For every E in S,0 < pr{E} <1. 

(d) If Beis a subset of Ly, pr{ £1} > pr{ Bs}. 

(e) For any collection Fi, Es, Ba, . . « (mutually exclusive or not), 
pr{ 2E} < Zpr{En}. 


The first four of these are almost obvious, but the student should 
check for himself that they follow directly from the axioms. With 
regard to (e), we might remark that the proof depends on the fact that, 
if any of the sets have points in common, the probability of these com- 
mon points is counted only once on the left-hand side, but more than 
once on the right hand side. For more about sums of arbitrary sets, 
see Chap. 4. 

One or two of these results deserve further comment. Referring to 
our table of parallels in terminology, we see that (b) says that, if an 
event is logically impossible, its probability is zero. This is very com- 
forting, but it is important to note that the converse is not true. 
There are many examples (some of which we shall see very shortly) of 
events with probability zero which are by no means logically impossi- 
ble. Finally, we should note that a very useful way of stating (d) is: 
If E, implies Hy, then pr{ £1} > pr{ £2}. 


9. Stochastic Variables 


So much for the abstract description of mathematical probability. 
The question that now arises is: What terminology and notation can 
we use to describe specific distribution functions? Can we write down 
formulas for them? If so, in terms of what? 

A very useful notion in this connection is that of the stochastic vari- 
able. The word “stochastic” comes from a Greek stem meaning 


3 2 
20 THEORY OF PROBABILITY [emap. 2 


chance. The expressions “chance variable” and “random variable” 
are also used by various authors as being synonymous with the expres- 
sion “stochastic variable.” The general idea is this: Instead of trying 
to describe a function pr{ E} over point sets, let us associate the points 
of the event space S with numbers and then describe the distribution 
function over sets of these numbers, Now when we say, “Associate 
the points of S with numbers,” that is only another way of saying, 
“Define a real-valued function over the points of 5.” So the formal 
definition of a stochastic variable is very simple. 


Definition 2. If S is an event space with a distribution function 
attached and if is a real-valued function defined over the points of S 
(“.e., to each point of S there corresponds a value of x), then a will be 
called a stochastic variable. 


Now, the distribution function for S automatically defines the proba- 
bility of each value or set of values for x. For instance, prf = a) 
is the probability of the set in S for which x = a. Similarly, 
pr{a < x < b} is the probability of the set of points in S for which 
Sosi However, this does not work the other way unless the 
correspondence between points of S and values of x isl tol. Suppose 
there are two points P and Q in S for which «(P) = 2(Q) = a; then 
prix = a} = pr{P + Q}, but this determines neither pr{P} not 
pr{Q}. Our Present purpose is to see how a function of the stochasti¢ 
variable can be defined so as to determine completely the distribu- 
tion function for the event space 5S. Therefore, for the remainder 0 
this chapter, We confine ourselves to a discussion of stochastic variables 
for which the correspondence between points of S and values of # 38 
1 to 1, 

This restricts the discussion in two ways. First, the requirement 
that the points of § be in 1 
numbers means that 5 is ( 
In case it is impossible, i 
the event space ag 


ochastie vari 
These, too, have 


SEC. 10] MATHEMATICAL PROBABILITY 21 


x axis and associate with each point its abscissa. The question of 
where we place the origin with respect to the set S is of no great impor- 
tance for the present. Sometimes a strategic placing of the origin will 
simplify computations. More often, the origin is located so that the 
values of the variable x will have a useful physical significance. In 
any case, the usual plan is to decide what values we want x to run 


through and place the set S accordingly. 


10. The Discrete Case 

For the discrete case (a finite set or a sequence) the plan used in the 
dice problem will serve admirably. Let us define a function T(x) such 
that, for each value a assumed by x, f(a) = prix = a} (this probability 
to be determined by hook or crook or Definition 1 from the physical 
situation). Then, if Z is any set of x values, we set 


pr{H} = Y s(x). 
zin E 
It is an easy matter (which we leave to the student) to check that, if 
f(x) is defined as we have suggested, the above formula gives a distri- 
bution function satisfying the axioms. Thus, for the discrete case, we 
have accomplished the end we had in mind. We have defined a func- 
tion f(x) of the stochastic variable x which describes completely the 
distribution function for the event space over which v is defined. We 
shall call f(x) the probability function for the stochastic variable x. 

If we take any function f(x) and define from it a function of sets by 
adding the values of f(x) over the set in question, the resulting function 
of sets will satisfy the addition axiom C. However, we get A if and 
only if we require that f(x) > 0 for every x; and B is equivalent tothe 
condition ») f(x) = 1. Therefore, we might say that S(x) will be 


zins 
called a probability function if it satisfies these two conditions. Let 


us summarize all these remarks as follows: 


Theorem II. If f(x) is a function defined over a discrete set S of 
values of x and if 


(a) ` JÆ) > 0 for every x in S, 
®) fla) =1, 


then f(x) is a probability function, and x is a stochastic variable with a 


Accessioned No. leher 


es p.a 
22 THEORY OF PROBABILITY [cuar 


distribution function given by the formula 


©) pre} = X st). 


zin E 


Conversely, if x is a stochastic variable with a distribution function 
pr{E}, then a probability function f(x) is given by the formula 


(d) f(a) = pr{x = a} for each a in S. 


All this is fairly obvious in the finite case. Even if S is an — 
sequence of points, the statements in Theorem II are easily checkec 
provided that all infinite series encountered are convergent. Now, (b) 


says that Ys) is convergent; but we must deal with sums over sub- 
5 


sets of S as well—i.c., with subseries of Y Jo). A necessary and sufti- 
5 
cient condition that every subseries of $ e) be convergent is that this 
= iow 
series itself be absolutely convergent—i.e., that X re be convergent. 
5 
Condition (a) tells us that |f(x)| = f(x) for each x; so the convergence 
guaranteed by (b) is automatically absolute convergence, and it follows 
that all subseries are convergent. 
Theorem II suggests a practical way of representing a physical 
situation by a mathematical model. In the discrete case, if a proba- 
bility function is fully described, the representation is complete. The 
significance of this remark is very graphically illustrated by the obser- 
vation that in the roulette problem a complete tabulation of the proba- 
bility function J(x) would require only 37 entries, while a complete 
tabulation of the distribution function pr{E} would require over 137 
billion entries, 
Tt should be clear from the above discussion how a stochastic varia- 
ble representation can be constructed for a given physical situation, bu t 
the student might find it useful to have a systematic description of the 
_ Steps involved. 


Scie H (Working Rule). To set up a discrete stochastic vari- 
able representing a given physical situation 


( ) Tak u 
a) Take as values of «any set of num = o 
i y bers that seems Vpproy 


e scheme for associati j ach possible 
result of the experiment a value of x ing with each | 


SEC. 12] MATHEMATICAL PROBABILITY 23 


(b) Determine from the physical situation the probability of each of 
the results. Definition 1 may be useful in this step. 
(c) Set f(x) equal to the probability of the result represented by x. 


11. Example—Balls from an Urn 


To illustrate, step by step, the use of this rule, let us consider another 
simple example. An urn contains 5 red balls, 3 white balls, and 7 blue 
balls. One ball is to be drawn at random, and we shall be interested 
in the probability of its being a certain color. Represent this situation 
by a stochastic variable. 


(a) We want to distinguish three possible results of the experiment; 
So we need three values for x. There being no particular reason for 
doing otherwise, we shall let these values be 1, 2, and 3 and associate 
them with the three possible results as follows: 


1: A red ball is drawn 
2: A white ball is drawn 
3: A blue ball is drawn 


(b) If we assume that each individual ball is as likely to be drawn as 
any other, Definition 1 tells us that the probability of drawing a red 
ball is 545 or 1g. Similarly, that of drawing white is 14; blue, 7/5. 

(c) Putting these results together, we define f(x) by the following 
table: 


12. The Continuous Case 
Turning, now, to the continuous case, suppose, first, that S is the 
entire x axis. The recollection that, if f(x) is integrable, 


[sear + [tear = [ied 


Would suggest that an integral might give us a function of sets satisfy- 
ing the addition axiom. Actually, if we think of an integral as it is 
introduced in first-year calculus, there are two things wrong with this 
Suggestion. Such an integral cannot be defined over every measurable 
Set, whereas a distribution function should be. F urthermore, there is 
trouble with the addition principle. This principle holds whenever all 
the integrals concerned are defined; but if we allow infinite sums, it may 


s 9 
24. THEORY OF PROBABILITY [enap. 2 


fail to make sense with probability defined as a Riemann (first-year 
calculus) integral. These are serious theoretical difficulties, but in 
this course we shall ignore them. ‘This means that we shall set up an 
integral, call it a distribution function, and say it satisfies the axioms, 
whereas, actually, it does no such thing. Our justification for doing 
this is that there is a theory of integration, due largely to the French 
mathematician Lebesgue, which overcomes both these difficulties; and 
the modern theory of probability is developed in terms of these 
integrals. 

In order to present a treatment representative of present-day theory 
of probability, we shall describe distribution functions as integrals. 
On the other hand, in order to stick to concepts familiar to the begin- 
ing student, we shall write all such integrals-in the familiar form 


v à 
f, f(x)dx or perhaps as multiple integrals of the type encountered in 


first-year calculus. The general effect of this is that we shall seem tO 
assume that all events can be represented by intervals on the x axis. 
We trust that the student will realize that this is not actually the cases 
but we trust that he will benefit by having the mathematical develop- 
ment put in terms of concepts he understands. 

Finally, we should like to reassure the student that, whenever the 
Riemann integral applies, it gives correct results. The Lebesgue 
theory does not introduce different. mechanical procedures; it only 
covers a wider variety of cases. 


Having decided, then, to set pr{Z} = fe fæ)dxz, what must we 
require of f(x) in order to have axioms A and B satisfied? Clearly, A 


holds if and only if f(x) > 0 for every 2; B says i f(x)dx = 1. 


We have been operating so far on the assumption that S was E 
entire x axis. If we havea problem in which it seems advisable to use 
only a part of the line, we can set f(x) = 0 on the part we do not u5% 
and all formulas will be perfectly correct. Therefore, these other ane 
tinuous event spaces need no separate disc i 

So, again, we have a fu 
operation on that f 
distribution functi 


á ussion. ait 
function f(x) of the stochastic variable and re 
unction which will determine all the values of 1)" 


Dit ie. ae II for the discrete case. How pe 
fnd thy at 1S, given a distribution function, how d° ` 
e density function? This is practically obvious; the distrib" 


SEC. 12] MATHEMATICAL PROBABILITY 25 


tion function is obtained by integrating the density function; so the 
density function must be the derivative of the distribution function. 
Let us now state our theorem. The student should note the parallel 
between this and Theorem IT. 


Theorem IV. If f(x) is a function defined over the entire x axis and 
if 


(a) f(x) => 0 for every x, 
(b) e {@)da = 1, 


then f(x) is a density function, and {v is a stochastic variable with a 
distribution function given by the formula 


(c) pr{E} = i Px)de. 


Conversely, if x is a stochastic variable with a distribution function 
Pr{E} and if F(t) = pr{x < t} is an indefinite integral, then a density 
function f(x) is given by the formula 


(@) fe) = Ère). 


Apropos the remark we made to the effect that probability zero 
does not necessarily mean logical impossibility, let us note that 
f ° Fæ)da = 0; thus in the continuous case the probability of a single 
point is zero. Furthermore, the sum of even an infinite sequence of 
zeros is still zero; therefore in the continuous case the probability of 
every discrete set is zero. From this it follows that, if we talk about 
the probability. that 2 is in a certain interval, it makes no difference 
whether or not we include the end points as part of the interval ; that 
is, pr{a < x < b} = pr{a < x < b}. 

There still remains the problem of finding a density function to go 
with a given physical situation. Definition 1 is of no use whatsoever 
here, because it is limited by its very nature to the discrete case. How- 

| ever, many continuous case problems are based on an idea very similar 
to that of equally likely results. This is the idea of choosing a point 
at random on a line interval. Just as with equally likely results, it is 
useless to try to define this phrase in physical terms. The best we can 
do is state the continuous case parallel to Definition 1. 


q d c ITY CHAP. 2 
ABILITY [ 

THEORY OF PROBABI 

26 


t i interval of 

Defini i € i is hosen at random on a lin: 1 

inition 3. Ifa point is ¢ SEN d aoe a e eat 

ee A. the probability that it is in a given subinterval of leng a 
, J 

is a/A,. 


i chosen at random 

It follows immediately from this that for a point chosen peice 
n the interval from 0 to A, the density function is equal to i 
the ftecronll and 1/A inside. For, if 0 < ¿< A, by Definition ¢ 


t 
priz < t} = pr{0 < x < i} = FE 


ion i / its derivative is 1/4. This 

thus the distribution function is x/A, and its deriv ie a a See 
disposes of a simple, yet important, special case; but it n j r ome 

‘ ps is f i astie corde > re recs 
a general procedure for setting up the stochastic variable rey 
h 
tions in the continuous case. 1 by the following observation? 

ral procedure is suggested by the fo ` 
One such general procedure is sugg N 


t+dt : 
prit <x <t+ di} = f S(x)dx, 


baa rec ual 
and this integral is approximately equal to [(i)dt es vie ‘dl: 
in the sense that the error is an infinitesimal of higher or bes bability 
So, if the physical setup led us to an approximation to thie ts fn the 
that ¿<r <e+ dt, and if we expressed our approxima aona 
form of a function of { multiplied by dt, then we api 7 er 
expect that this function would give us the proper form for the de 
function. : , proof that 

If we intend to use this suggestion, we should give some a (a) ant 
it gives a function f(x) that satisfies the conditions ing eae ae prob- 
(b) of Theorem IV and that the application of Theorem IV yields etion: 
abilities consistent with those assumed in setting up the ae " 
This is the type of proof we have decided to omit. For the eee we 
those familiar with a rigorous definition of the Riemann integ much 
might add that Duhamel’s theorem will do the job without le york 
trouble. [See Franklin, Treatise on Advanced Calculus, New 
(1940), pages 266 to 269.] 

Another procedure that frequently 
random point on a line problem. 
from the physical situation that will 
function F(t), then [by 
the density function, 


the 
eint 
works is that used above 10 | 


atio! 
If we can dig up any informati? 
enable us to express pr {x eus us 
(d) of Theorem IV] the derivative of F giy 
Again, let us give a formal summing up- 


Theorem V (Working Rule). 


a 
. „sent 
In the continuous case, to repre 
Physical situation by a stochasti 


c variable 


SEC. 13] MATHEMATICAL PROBABILITY 27 


(a) Represent the set of all possible results in some orderly fashion 
by some set of values fora. The set of x values used may or may not 
be the whole real number system. 

(b) Dispose of any unused x values immediately by setting f(x) = 0 
for these values. 


Then, proceed to (c) and (d) or to (c) and (d’). 


(c) For ¢ and £ + dt both lying in the set of x values that represent 
results, figure out from the physical situation an approximation to the 
probability that ¢ < x < t+ dt. Express the result in the form 
S(Odt. 

(d) Take the function S(t) from (c); replace the ¢ by x; use the result- 
ing function f(x) as a density function. 

(c) For tin the set of x values that represent results, figure out from 
the physical situation the probability that x < t. Express the result 
in the form F'(). 

(d') Take the function F'() from (c’); replace the £ by x; differentiate 
with respect to x. This derivative will be the density function. 


With regard to the word “approximation” in (c), it should be added 
that it must be an approximation for which the error is an infinitesimal 
of higher order than dt. Many first-year calculus books give the sub- 
Ject of order of infinitesimals only the “once over lightly” treatment} 
So while this is a precise description of what is required, it might be of 
more interest to the student to note that the choice of this approxima- 
tion is governed by the same rules as those which govern the choice of 
approximations to increments of arca, volume, work, etc., in setting up 
definite integrals. In fact, steps (c) and (d) are just another first-year 
calculus exercise in setting up definite integrals. 


13. Example—Bombardment of Hemispherical Screen 


To illustrate step by step the use of Theorem V, let us consider the 
following problem: A hemispherical screen is bombarded by a stream 
of electrons (see sketch). It is assumed that the radial distribution of 


v 
N 
~ 
& 
è 
Ww 


28 THEORY OF PROBABILITY [cuar. 2 


electrons is uniform and that all electrons considered hit the screen. 
Set up a density function that will determine probabilities as to the 
colatitude ¢ of the point at which a given electron, chosen at random, 
will hit. (a) For obvious reasons, we shall call the stochastic variable 
g in this case. Results will then be represented by the values of ¢ 
from 0 to 7/2. (b) According to instructions, we set fle) = 9 for 
o <0 and for g > 7/2. (c) The event i < y < t+ dt is equivalent 
(see sketch) to the event that the 
radial distance falls in an interval of 
length 7; therefore its probability 18 
r/R. Now, r = s cos t, and s is ap- 
proximately R dt;hence r/R œ cost dl- 
(d) Replacing ¿ by g, we have the 
density function f(g) = cos ¢. 


14. Example—The Radium Atom 

To illustrate the primed version of 
Theorem V, we might try the problem 
of the radium atom. It has beer 
i verified experimentally that the tim? 
rate of decomposition of a quantity of radium is proportional to the 
mass. That is, 


a 
a eM. 


Furthermore, it is known that the mass is halved in 1,580 years. Thus 
if we have mass mo at £ = 0, we have mass mo/2 at t = 1,580. Arme 
with this information, we can find m in terms of ¢: 


Integrating, we get 


log m = —kt + C. 


Pr 16 iti 
From the condition m = mo at t = 0, we have 


i C = log mo. 
rom the condition m = mo/2 at t = 1,580, we have 
b= og? 
aa 1,580 


M = moeto 2/1,580)t 
? 


and the fracti a 0" 
raction of the original mass that disintegrates from time ot 


Sic. 15] MATHEMATICAL PROBABILITY 29 


time £ is 
mo — m = 1 — olor ?/1,580)t, 
mo 
Now, if this is what happens in the aggregate, we might take it as a 
reasonable indication of the probability that a given atom will disin- 
tegrate before time ¢. 

So, to follow through the steps in Theorem V, suppose we are given 
an atom of radium at time 0. (a) Let x stand for the time it disin- 
tegrates. (b) We want only the positive half of the x axis; so we set 
F(x) = Ofora < 0. (c’) On the basis of the argument above, we agree 
that 

prfx <i} = 1 — etos 21,580, 


(d’) Substituting and differentiating, we get the density function 
d 5 ii 
je L c eti = ph 
fee) = £0 - e) = ke 


where k = (log 2)/1,580. 


15. Synthesis of Discrete and Continuous Cases 


In succeeding chapters we shall prove a number of theorems about 
probabilities and related quantities. Many of these proofs will be 
based on the analytical representation of mathematical probability. 
Unfortunately, however, we have two such representations, (c) of 
Theorem II and (c) of Theorem IV. There is no way of consolidating 
the two for purposes of making direct computations. If we want to 
find a specific probability from a stochastic variable representation, in 
the discrete case we must add. In the continuous case we must 
integrate. However, there is a thing called a Stieltjes integral which 
has as special cases ordinary integrals, finite sums, and infinite series. 
So the two representations could be consolidated into a single formula 
involving a Stieltjes integral. This procedure would serve to unify 
subsequent theoretical discussions but would have no effect whatever 
on the nature of the computations involved in a specific problem. 

We shall not attempt to use Stieltjes integrals in this course because 
we want to stick to analytic forms that are familiar to the first-year 
calculus student; so everything we discuss will come in two cases. 
Rather than give two proofs of each theorem, we shall usually present 
the theory for the continuous case only. Of course, many of the exam- 
ples and exercises will fit the discrete case. Therefore, for everything 
we do with integrals, the student might do well to run through an 
analogous operation with sums. We have chosen to present our dis- 


30 THEORY OF PROBABILITY {cuap. 2 


cussion in terms:of integrals for two reasons: first, many of the een? 
x p is 4 5 
mental relationships stand out more clearly when presented in io 
form; second, we want to accustom the student to thinking of proba A 
: j $ j n: P re € 
bility as an integral because it appears in that form in advance 
treatises on the subject. EE 

i i i iesi sepals is considers 

One thing we lose by not using Stieltjes integrals is the consider: 


bs de = 1⁄2 

of the so-called mixed case. Suppose f(x) > 0 and a „JU )da My 
i Tow, let us 

Then, let us take some point xo and set prix = xo} = 13. Now, letu 


define pr{ E} as je f(x)dx provided that Z does not contain xo, and let 


priE} = 14 + Ja f(x)dx if E contains xo. This distribution function 
clearly satisfies the axioms, but it is not described by either a — 
probability function or a density function. Having mentioned eis 
possibility, we shall now proceed to ignore it. The only eatisfacton 
description of all such cases (they can be much more complicated in 
the above example) is a Lebesgue-Stieltjes integral. So here is one 
more way in which our treatment of probability theory—though repi 
sentative in at least a formal sense—is definitely restricted in the inte? 
ests of mathematical simplicity. ; cyte 
For the benefit of the student who is interested in persuing this st a 
ject further, we might add that the chapters of Cramér’s book Jiste 


A x a alting integral - 
below contain an excellent discussion of the Lebesgue-Stieltjes integ 


designed specifically as an introduction to probability theory. 


REFERENCES FOR FURTHER STUDY 


P > 5) 
Coolidge, An Introduction to Mathematical Probability, Oxford (192971 
Chap. I: ” 


Cramér, Mathematical Methods of Statistics, Princeton (1946); ChaP® 
1-7, 13, 14. 50): 

Kolmogoroff, Foundations of the e 

Levy and Roth, Elements of Probability, Chaps. I, 11, IV, VI. ; 

Struik, “On the Foundations of the Theory of Probability,” Philos? 
of Science (1934), vol. 1, pp. 50-70. 

Uspensky, Introduction to Mathematical Probability, New York 
Introduction, Chaps. I, II, XII, XIII. 


Theory of Probability, New York (1 
phy 


(1937) 


PROBLEMS dice? 
ty of throwing 8, 9, or 10 with 2 1 
Ans. 7 
er | 
of throwing 6 or less with 3 dice’ 5 
Ans. 4 


5 balls numbered 1 to 15) is placed in 2 


si. What is the probabili 
42. What is the probability 


3. A set of pool balls (1 


MATHEMATICAL PROBABILITY 31 


and 2 balls are drawn simultaneously. What is the probability that 
the sum of their numbers is 10? Ans. 4705. 
4. If, in Prob. 3, one ball is drawn and replaced and then the second 
is drawn, what is the probability that the sum of the numbers is 10? 
Ans. V5. 
5. In Prob. 3, what is the probability that the product of the num- 
bers is 10? What is the probability of this event in Prob. 4? 
Ans. 2405, 4925. 
6. A roulette player places 3 bets on one spin of the wheel: 1 on 
rows 4 and 5 (numbers 10 to 15), 1 on the first column, and 1 on red. 
What is the probability that at least 1 of these will win? that the first 2 
mentioned will win? that all 3 will win? Ans, 2887, 247, 0. 
7. Four tickets numbered 1, 2, 3, 4 are placed in an urn and drawn 
out one at a time (without replacements). They are renumbered in 
the order in which they are drawn. What is the probability that all + 
numbers are changed? Ans. 3g. 
8. Find the probability of being dealt each of the following poker 
hands: 1 pair, 2 pairs, 3 of a kind, full house, + of a kind. 
Ans. 35233; 198/4,165; 88/4,165; 6/4,165; 1/4,165. 
9. A poker hand contains 4čards of one suit and 1 odd card. If the 
odd card is discarded and 1 card drawn from the remainder of the deck, 
What is the probability of filling the flush—getting a fifth card of the 
given suit? Ans. 947. 
10. A poker hand contains + cards in sequence (not A234 or JQKA) 
and 1 odd card. If the odd card is discarded and 1 card drawn from 
the remainder of the deck, what is the probability of filling the straight? 
(A straight is a 5-card sequence, not necessarily all of the same suit.) 
Ans. 847. 
11. What is the probability of filling an “‘inside” straight? The 
hand contains 4 of a 5-card sequence with a gap in the middle (for 
example, 4578); the odd card is discarded and 1 card drawn from the 
remainder of the deck. Ans. 447. 
12. A poker hand contains a 4-card sequence (as in Prob. 10) all in 
the same suit. If 1 card is drawn, what is the probability of filling 
cither the straight or the flush? Ans. 1547. 
13. Let x represent the number of aces in a single 13-card bridge 
hand. Find the probability function for the stochastic variable a. 
Ans. 2: 0 1 2 3 4 
. 88013 4- 8Co 6+ 48C 4+ 480, 48", 
F(x): C4, "Ui 5203 BC, "Cas 
14. What is the probability that a bridge hand will contain no more 
than 1 ace? Ans. 88j9- 9013/8013, 


i 2 
32 THEORY OF PROBABILITY [cHAP. 4 


15. Two urns contain, respectively, 3 white, 5 red, 12 black balls 
and 8 white, 3 red, 9 black balls. One ball is drawn f rom each urn. 
Represent the 9 (ordered) color combinations by a stochastic variable, 
and find its probability function. 

dns: w 1 2 2 4 5 6 7 8 ae 

f(x): 24400 %o0 27400 42400 15400 45400 99400 39400 1 7400 
Note: The student should indicate the significance of each value of x i 

16. From the results in Prob. 15, find the probability that neither 
ball was black; that they were not both black. Ans. 1150, 73400- 

17. Three balls numbered 1, 2, 3 are placed in an urn. A ball is 
drawn and replaced; then another ball is drawn. Lët the stochast 
variable x represent the sum of the 2 numbers drawn. Find the proba- 
bility function for v. Ans. o 2 8 & 8 ‘i 

fa): 4 36 4 26 8 

18. Suppose the second drawing is made without replacing the firs 

ball. Find the probability function for the sum of the numbers drawn 
3 + 2 
» %KA 

19, From this same urn the balls are drawn as in Prob. 17. Find the 

probability function for the product of the numbers drawn. ; 
Ans. a L 2 8 4 6 14 
fa): E 3668 

20, Find the probability function for the product of the number? 

drawn when the drawings are made as in Prob. 18. 6 
Ans. ew: 2 3 4% 
Ja): A is 

21. An urn contains 10 white balls and 10 black balls. 
are drawn simultaneously. Find the probability function for th 
ber of white balls drawn. a 

Ans. vt 0 1 2 3 4 5 


: 25 
fiz); —252_ 2100 5,400 5,400 2,100 =a 50! 


6 


Ans. Gi 


í 

23. The stochastic variable x can assume the values 0, 1, 2, 3, ctio? 

} proportional to a’/r!. Find the probability io” gil i 
PA Ans. f(x) = are/x! (Poisson’s distribu if 
ariable x can assume the values 1, 2, + ' poy a l 
ional tor. Find the probability functioP — j); | 
Ans. f(x) = Que /n(v 

7 

d 


24. The stochastic y 
with pr{x = r} proport 


MATHEMATICAL PROBABILITY 33 


25. A radio station broadcasts the correct time every hour on the 
hour. What is the probability that a listener tuning in at random will 
have to wait less than 10 minutes to get the correct time? Ans. 16. 

26. A point is chosen at random on a line segment, dividing it into 2 
segments. Find the probability that the ratio of the length of the 
left-hand segment to that of the right-hand one is less than a given 
number a. Ans. a/(1 + a). 

27. In Prob. 26, what is the probability that the ratio of the length 
of the right-hand segment to that of the left-hand one is less than e? 

Ans. a/(1+ a). 

28. In Probii26, what is the probability that the length of the shorter 
segment to that of the longer one is less than 14? Ans. 1⁄5. 

29, Given an atom of radium at the beginning of a leap year, find the 
probability that it will disintegrate during a leap year. (Use a Julian 
calendar—every fourth year a leap year.) 

Ans. (1 — e*)/(1 — e*); k = (log 2) /1,580. 

30. A point is chosen at random on & semicircle and projected onto 
the diameter, Find the density function for the point of projection. 

Ans. f(x) = 1/r-V1 — # for —1 <2 <1, f(x) = 0 otherwise. 

31. An angle 8 is chosen at random between —7/2and 7/2 and a line 
is drawn through the point (0,1) at the angle 0 with the y axis. Find 
the density function for the point x at which this line crosses the 
x axis, Ans. f(x) = 1/r(1 + x°) (Cauchy’s distribution). 

32. A number is chosen at random between Oand 1. What is the 
probability that its first decimal place is a 7? that its second decimal 
place is a 7? that the nth place is a 7? Ans. Mo in each case. 

33. A number is chosen at random between 0 and 1. What is the 
probability that its first 2 decimal places are 7’s? that any 2 specified 
decimal places are 7’s? Ans. Moo in each case. 

34. A number is chosen at random between Oand 1. What is the 
probability that it has a 7 in each of k specified decimal places? 

Ans. 1/10. 
Note: Problems 33 and 34 should be solved directly from Definition 
3. See comment, Prob. 33, Chap. 3. 

35, A number is chosen at random between 0 and 1. Let x, be the 
nth decimal place. Find the probability function for tn. 

Ans. flen) = o; ta = 0,1,2,...,9. 

36. Prove Theorem I, using only the axioms. Outline: (a) follows 
from Axioms B and C; (b) follows from (a) and Axiom B; (c) follows 
from (a) and Axiom A; (d) follows from Axioms A and C; (e) may be 
proved by induction, using Axiom C and (d). 


CHAPTER 3 
JOINT DISTRIBUTIONS 


In the previous chapter we discussed the representat ion of a situation 
by a single stochastic variable. Such a representation amounts t0 
lining the possible events up in a row and assigning abscissas to them 
(and then, of course, computing a suitable density function). Hawi, 
there are situations in which such a linear array of the possible events 
does not adequately describe all the similarities and interrelationships 
among them. For instance, in the throwing of 2 dice, we might ue 
to list as the set of possible events the set of all ordered pairs of numbers 
that might show up. These could be laid out in a row and numbered 
in some orderly fashion, but such a set of events is just crying to be 
arranged in a 6X6 square array. Perhaps this point is even petor 
illustrated by Prob. 15, Chap. 2. There each event consists of a pair 
of colors (9 such pairs in all). These, too, could be laid out in a row 
and numbered (and that is what we suggested in the answer we gave), 
but is it not much more natural to arrange these pairsina 3 X 3 square 
array? If we had 3 urns containing 3 colors cach, we should be led in 
the same manner to a 3 X 3 X 3 cubical arrangement. 

Granted, then, that there are situati 
two or more dimensions give better representations than linear sets. 
We shall take the hint and develop a theory of two or more stochastic 
variables. One thing we might note to begin with is that, when we use 
a multidimensional representation of a physical situation, we describe 
each basic event (point) in the space as a logical product of two or more 
events. That is, if A is the event x = x and B is the event y = Yo 
then the point (toyo) is the event AB. We shall frequently refer to 
such product events as compound events. The individual events of 
which a compound event is the product we shall call its component 
events. We might add that “compoundness” ig not an intrinsic prop- 
erty of physical events, We use the phrase only to distinguish certain 
events in a physical situation for which a multidimensional event space 
seems to be the natural representation. 


16. Joint Density Functions 


ons for which sets of points in 


In the interest of m 


; aking ideas clearer 
simpler, we shall confin 


by keeping the notation 
e most of 


our discussion in this chapter to the 
34 


SEc, 16] JOINT DISTRIBUTIONS 35 


two-dimensional case—i.c., the case in which each compound event has 
two components. The student who conquers this two-dimensional 
discussion should have no trouble in following through all the steps for 
4s many variables as he pleases. 

Suppose the event space S is a two-dimensional set. Events will be 
measurable subsets of S, and the axioms will be expected to apply 
verbatim. To get a stochastic variable representation, let x and y be 
rectangular coordinates in the plane in which S lies. Then, v and y 
are functions of the points of S, so each is a stochastic variable (Defi- 
nition 2, Chap. 2). To describe the distribution function pr{B} in 
terms of a function g(x,y), let us note that a double integral of ¢ satis- 
fies the addition principle. Therefore, all we need is a pair of condi- 
tions on g(x,y) to guarantee the satisfaction of Axioms A and B. 


Obviously, these are g(x,y) > 0, and ee g(x,y)dx dy = 1. Sucha 
function (ay) we shall call a joint density function. The distribution 
function pr{E} that it generates we shall call the joint distribution 


Junction for x and y. 

This gives us the first half of a theorem similar to Theorem IV, 
Chap. 2, The second half [a formula for the joint density function 
(x,y) in terms of the distribution function pr{Z}] is obtained by noting 
that the inverse operator to a double integral is a cross partial deriva- 


tive. That is, if 3 n 
Bue) = f" fE e@uae dy, 


arn 
plun) = aan 


then 


One more remark is called for before we give a formal statement of 
these results. We can always speak of g(x,y) as being defined over the 


entire xy plane. If S is not the whole plane, we set g(x,y) = 0 


outside S, 


Theorem I. If g(x,y) is a function defined over the entire ay plane 
and if ` 


(a) ery) Z 0 for every 2, Y, 
E i Je: o(x,y)dx dy = 1, 


then ¢(x,y) is a joint density function, and a and y are stochastic varia- 
les with a joint distribution function given by the formula 


“e priB) = f fp ewe ay. 


36 THEORY OF PROBABILITY [cmar. 3 


In particular, Bai 
(a) pria<x<bande<y<d}= i J o(x,y)dx dy. 


Conversely, if x and y are stochastic variables with a joint distribution 
function pr{Z} and if #(u,v) = pr{x < u and y < v} is an indefinite 
double integral, then a joint density function g(x,y) is given by the 
formula 
Q olen) = Pew. 

Many times, in what follows, we shall have occasion to reverse the 
order of integration in a double integral. We shall do this without 
calling any particular attention to it. A word should be said here 
about the justification for this procedure, particularly since we are so 
often dealing with improper integrals. The standard theorem on this 
is that the order of integration may be reversed if the integral is abso- 
lutely convergent (see Franklin, Treatise on Advanced Calculus, page 
398.) Now (a) of Theorem I means that any convergence of integrals 
of ọ is absolute convergence, and (b) says (among other things) that 
all integrals of ø are convergent. So, as long as ¢ is our only integrand, 
we may change the order of integration at will. Later on (Chap. 6) we 
shall have other integrands to deal with, but we shall make our defi- 
nitions in such a way that absolute convergence is still guaranteed. 

As we noted at the end of Chap. 2, we shall develop the theory in 


terms of the continuous case. However, we might do well to state the 
discrete case parallel to Theorem I. 


Theorem Il. If (x,y) is a function defined over a discrete set S 
of points in the zy plane and if 


(a) (x,y) > 0 for every point (x,y) in S, 
© > X eea) =1, 
(zy) in S 


then g(x,y) is a joint probability function, and x 


r i oir and y are stochastic 
variables with a joint distribution functio 


n given by the formula 


© priE) = Y Y glen). 
In particular, Gu) in E 
(d) 


prix = a» and Y = Yo} = (20,40). 


er A f ’ 
onversely, if x and y are discrete stochastic variables with a joint dis- 


SEC. 17} JOINT DISTRIBUTIONS 37 


tribution function pr{Æ}, then a joint probability function ø(x,y) is 
given by (d). £ 


17. Marginal Distributions 

We now raise the question whether stochastic variables described by 
a joint density function have their own density functions as in Theorem 
IV, Chap. 2. This is, indeed, the case; and these individual density 
functions are determined by the joint density function. Note that 
the single pair of inequalities a < x < b describes an entire vertical 
strip in the xy plane; so 


prla<sex<b) = f"  o(eydx dy = è f° o(ay)dy dx 

i æ Ji á Í J- 2 a 
= f} ds, 
where 


fa) = [7 edy. 


It is now easily seen that f(x) will serve as a density function for the 
stochastic variable x. (Set a = t, b = t + dt, and apply Theorem pue 
Chap. 2.) Similarly, we may obtain the density function 


ay) = [7 e@u)ar. 


These individual density functions f(x) and g(y) are called the 
Marginal density functions, and the distribution functions they generate 
by means of (c) of Theorem IV, Chap. 2, are called the marginal dis- 
tributions. This terminology probably comes from the fact that in the 
finite case a convenient way of describing ¢(z,y) is to write its values 
Out in a square array. Then, f(x) and g(y) are given by adding the 
Columns and rows, respectively; and the values of these functions are 
very conveniently listed around the margins of the square array. 

It should be emphasized that any discussion of two stochastic varia- 

es is based primarily on the joint density function and not on the 
marginal density functions. While we have just seen that the joint 
density function determines the marginal ones, the reverse is not true 
aball. It is very easy to find two fundamentally different joint density 
functions each of which gives rise to the same pair of marginal density 
functions, Consider, for instance, 

e(z) = . +y for os z< landO<y <1, 

0 otherwise; 
(as+n0¢+y) for0<e<landOsyX1, 
lo 


ic = otherwise. 


` 


38 THEORY OF PROBABILITY fenar. 3 


The equation (1g + x)(14 + y) = x + y has for its only esi = 
y = 4; thus, inside the unit square, g(x,y) = g(x,y) only K mg = 
lines. However, each will pass as a joint density function, and ea 
leads to the same pair of marginal density functions: 


I 


fle) = f7, eey)dy 
Jx) oe go(x,y)dy 
gly) jii gilx,y)dx 
gly) = | ok g2(x,y)dx 


Vg +2; 


Ss (£ + y)dy 
fi OS + 03+ way = 14 + e; 
M (w+ y)de = 15 + y; 

fi OS + 905 + pde = 14 + v 


Il 


18. Example—Two Dice 


As for representing physical situations by pairs of stochastic pa 
bles, the procedure is roughly the same as that outlined in Chap. 2 F 
getting one-variable representations. The general principle is that, 7 
we know the probabilities of all the events involved, the density fune 
tion is arrived at in the usual manner. The only real question ther 
concerns finding the probability of a compound event. Now, menna 
cases this can be done by inspection or by a direct application - 
Definition 1, Chap. 2, to the compound events themselves. The pre 
lem of the two dice is one that can be done by inspection. Let us look 
at it by way of illustration of the ideas presented so far. z 

We think of each result of throwing 2 dice as a compound event onl 
posed of 1 result on one die and 1 on the other. Now, let x anor 
the result on one die and y the result on the other. There are 6 possible 
results on each die: so we give x the values 1,2, . . . , 6 and do tha 
same for y. This aiva us 36 points in the xy plane, each of witch 
represents 1 of our compound events. Now (and this is what ee 
meant by doing it by inspection), these compound eyents seem equally 
likely ;so we give them each a weight of 14g. 


— the 
Thatis, weset o(x,y) = 3° 
for each of the 36 points 


at which it is to be defined. ia 
In this, as in any other finite case, an integral from —% to 2 © 
replaced by a finite sum (for this particu 
A double integral over 
points found in that 


lar situation, a sum of 6 terms): 
an area will consist of the sum of all ¢ values oe 
area. Bearing these points in mind, we might 
carry out a few of the operations on joint density functions already 
discussed. art shows the values of gay) 2 


The rows and columns are added 


ye . j : 
ability functions. Finally, the values of # 3P 
around the outside. 


The accompanying ch: 
appropriate places in the plane. 
give the marginal prob 
y are listed 


* SEC. 19] JOINT DISTRIBUTIONS 39 


y gw) 
6 1 
5 K 
14 x 
3 K 
2 x 
1 % 


Via. 4. 


4 
a , 
mr{38<a<dand4<y <5} = > p e(x,y) = #6 =16 


(area outlined by dotted line). However, the house pays off on the 
Sum of x and y; hence the events we are most interested in are those 
represented by the areas between diagonal lines. To compute these, 
the formality of a double sum is unduly laborious; so we dispense with 
it and merely add down the diagonal strips to get the usual table: 


"hy: 2 3 4 5 6 7 8 9 W 1H 1 
Pr: e 246 345 346 346 946 540 446 346 746 He 
It appears from this that the stochastic variable we used to describe 
the two dice in Chap. 2 is the sum of the two'variables we are using 
here, This is of more than just passing interest. In later chapters we 
Shall haye quite a lot to say about sums of stochastic variables, and the 
Student will want to Tenn what is meant by such a sum and what 
significance can be attached to it. These questions are discussed at 
Some length in Chap. 5, but the example given here might serve as a 
Starting point for the student’s thinking on the subject. 


19. Conditional Probabilities 

In setting up the joint probability function in the preceding example, 
We took a quick look at the compound events and decided on their 
Probabilities directly. Now, in many situations this is not feasible; so 
We turn our attention to the problem of working up from an analysis of 
the component events to a determination of probabilities for the com- 


40 THEORY OF PROBABILITY [enar. 3 * 


pound events. As we noted above, a knowledge of the marginal 
density functions will not suffice, because these do not determine the 
joint density function uniquely; and two stochastic variables are not 
adequately described unless the joint density function is given. 

So we need a new idea; and this idea turns out to be that of condi- 
tional probability. This notion is described by the following problem: 
Given a <x <b, what is the probability that c < y <d? The 
answer will be found in the usual manner [from (d) of Theorem I] once 
we obtain a new joint density function ¥(x,y) which describes our 
hypotheses. Since all events for which x <a or x > b are to be 
ruled out, we set ¥(«,y) = 0 over these portions of the xy plane. 
In the strip a < x < b we want x and y related as before; so we set 
W(a,y) = kye(x,y) over this strip. Now, w is determined as soon as we 
find the constant k. Noting that we must have 


ll 


1= fia jin y(x,y)dx dy 
=k ri o(a,y)dy dx 


it appears that 


l f: ko(z,y)dx dy 
k fi Koda, 


1 
k = ———: 
fiid 


Now, using (d) of Theorem I, we have that, given a < x < b, 


pric < y < d} = i f| vende dy 


Z 4 IN haa PA 
i Í. f(x)dx 


ve expression in brackets will be called the conditional density function 
or y: 


b 
o(x,y)da 
Jarx(y) = E 


a 


* (a)as ` 


The conditional probabilities we mentioned will then be give? by 
E E shig conditional density function. Our notation for condi 
e ea will have the condition described in a subscript’ 
= e probability that c < y < d, given that a < x < b, would be 


Prase<ole < y < d}. 


N N R ee — 


sec. 19] JOINT DISTRIBUTIONS 41 


To save ink, we shall frequently abbreviate this to 
prasle Sy < d}. 
In this notation, 
a 
prosle <y Sa} = f gaalu)dy. 
The other set of conditional probabilities (probabilities for v, given 


information about y) are defined in a similar manner. Let us collect 
these notions and state them formally as follows: 


Definition 1. Let (x,y) be the joint density function for a pair of 
stochastic variables « and y, and let 
fe) = ["e@ndy and) = [Z eade 
be the corresponding marginal density functions. Then the condi- 
tional density functions are defined as 


ad oe 

(x,y)dy f, o(x,y)dx 

(a) ig we gee ; 
Fea(x) ii ha Jab Y. foi 


Conditional probabilities for « and y are defined by 


©) prala <x <b} = fi * feaale)ax, ; 
prasle Sy < d} = [ gao(y)dy- 


es of single values for x and y are 
different from zero; and the usual form for conditional probabilities 
is sufficiently different to be worth mentioning. Note that, with the 
Usual transposition of integrals into sums, Definition 1 describes the 
discrete situation too. ‘The following is merely an important special 
case: 


In the discrete case, probabiliti 


e the joint probability function for a pair 


Theorem III. Let g(x,y) b 
olay) a diserete set, and let 


of stochastic variables x and Y, defined over 


f(x) = X ee) and = g(y) = X eeu) 


s 
v 


be the corresponding marginal probability functions. 
tional probability functions are given by 


Then, condi- 


0 J gl(zoy), 
l) fua) = gew, gz (Y) = f(E) 


42 THEORY OF PROBABILITY [cunar. 3 


Conditional probabilities are given by 


CEREN ee oa — G(%o0,Yo) 
(b)  pryave{x xo} Julo) aia) , 
ae 


o(Xo yo), 
Praal = Yo! = dali = 7 


f(t) 


For another very useful description of conditional probability in th 
finite case, see Theorem V below. 


20. The Multiplication Theorem 


Definition 1 shows clearly that, if the joint density function is given, 
the conditional probabilities are determined. The interesting thing a 
that we can go the other way. If the marginal density tunetions an 
all the conditional probabilities are known, the joint density function 
can be found. To see how this is accomplished, we need to look at the 
multiplication theorem. 


Theorem IV. 


pr{ABY = pri A}pra{ BY = pr{ By pref A}. 


fp eee ay 
o [rear 


b 
3 ji Sæ)dx f "gaaly)dy = pr{A}pra{ B}. 


d 
pri AB) = jA fi olew)de dy = fitode 


. . 7 
If A 's Itself a compound event A 14s, then Theorem IV 0% 
applied twice to give 
PASE) = pri AsAalpran(B] = pri Aitpra(Aalpraal 
Continuing in the sam a 


3 e ma YW r: ultip 
principle: anner, we get a general m 


SEC. 21] JOINT DISTRIBUTIONS 43 


Corollary 1. 


pr{ EEs - - - By) = pr| Ei} pre {Es} pres: {Es} 
++ a E EE SA 


AN . . P g tye egege 
The multiplication theorem "i us that conditional probabilities 


are described by the equation 
) _ pr{AB} 
(1) pra{ B} “pri A} 


Suppose we have a physical situation in which there are n equally 
likely results; suppose the event A is a set of r of these and 7’ of these 
Yr results are also contained in the event B. Then the event AB is just 


these r’ results; so by Definition 1, Chap. 2, 
prid} =~ 


pr{ AB} 5 


and Equation (1) tells us that 


I 
| 


rin? 
pra{ By = m F 


Po give a formal statement: 


Theorem V. Suppose the events A and B are sets of equally likely 
results of an experiment. If A is a set of r such results and if r’ of 
these r results are also contained in B, then pra{B} = r’/r. 


21. Bayes’ Theorem 

Conditional probabilities in general are described by Equation (1), 
Sec, 20; and problems involving conditional probabilities can fre- 
quently be solved by the direct application of this equation. How- 
ver, there is an interesting special case, known as Bayes’ theorem, 
that is worth taking a look at. 

Suppose we have a situation in which an event A can occur only in 
Conjunction with one of the mutually exclusive events Bi, Be, .. . , By. 


Symbolically, we write this 
A= AB;. 
p 


For a Physical picture, think of the events Bı, Bs, . . . , B, as the set 
of all Possible causes and of A as the result. Suppose, further, that 
the Conditional probabilities pra;{A} are known. We should like to 
find the reverse conditional probabilities prs{B;}. 


44 THEORY OF PROBABILITY [cHAP. 3 


Tf we think of the events B; as hypotheses or causes and of A as a 
final result, pra{B;} is interpreted as follows: If we know that the 
result is A, what is the probability that the cause of it was B;? For 
this reason the probabilities given by Bayes’ theorem are called s 
posteriori probabilities—the idea being that the probability of a en. 
ble cause is computed after the result has transpired. The studen 
should note that this name is given to the conditional probability 
pra{B;} only because of the applications usually made of this theorem- 
Fundamentally, it is just another conditional probability; and as is, 
student can see by looking at (b) of Definition 1, there is no essentia 
distinction between the two ideas of conditional probability. , 

Noting that the compound events AB; and B;A are the same thing, 
and applying Theorem IV, we have 


pr{A}pra{B;} = pr{AB;} = pr{B:A} = pr{B:}pra {4}. 


Solving this for pra{B;}, we have 


, are 
From the assumptions that A = BAB; and that the events Bi % 
mutually exclusive, we have, by Axiom C and Theorem IV, that 


pr} = ) pri Bilpra,tA}. 


ga 


u 
q oy aie r e5 
Substituting, we arrive at the formula usually known as Bay 
theorem: 


(1) pra{B,} = PriBilpratA) 
>, prt Bilpra,t A] 
j=1 3. 
. Å $ Be 
Before getting wild ideas about what Bayes’ theorem can do i 
we should look carefully at the right-hand side of (1). ‘The impor 


thing we see there is tha 
know all the a priori pr 
ask the question: 
set of populations, 
that it came from 


t, in order to compute pra{B:}, We weit pt 
obabilities pr{B;}. For instance, We iv! 
“Knowing that a sample came from one of & Pit 
can we tell by looking at the sample the proba se 
a certain one?” Let us note that before we 627, p’ 
Bayes! theorem to answer this, we must first answer the ques!" je 

Can we tell without looking at the sample the probability that ii -m 
rom a certain population?” In many cases intelligent assump we 
as to the a priori probabilities can be made, but we should note tb® 


SEC. 22] JOINT DISTRIBUTIONS 45 


do not get something for nothing here. We still have to start with an 
assumption, and all our results are based on that assumption. 


22. Construction of Joint Density Functions 


The notion of conditional probability furnishes us with the necessary 
tools for the construction of joint density functions to represent given 
physical situations. 


Theorem VI (Working Rule). If a situation involves events each of 
which is to be regarded as the product of two events, then to get a 
stochastic variable representation: 


(a) Classify the component events into two classes X and Y such 
that every compound event involved is the product of one event of 
class X and one of class Y. 

(b) Represent the events of class X by a single stochastic variable 
v, and find the density function (or probability function). Use 
Theorem II or V, Chap. 2, whichever applies. 

(c) Associate the events of class Y in some natural manner with 
values of a variable y. 

(d) Discrete Case. For each value xo of x, get pr{x = xo} from the 
Probability function in (b). Then, for each value yo of y, figure out 
from the physical situation pra»{y = yo} Put these results into 

heorem IV, and get pr{a = toand y = yo}. Use this as the value of 
(x0,yo). 

(d’) Continuous Case. For each value x of x, set up the usual 
Approximation pr{xo < « < vo + de} = f(vo)dx from the density func- 
tion in (b). Then, for each value yo of y, figure out from the physical 
Situation the usual type of approximation to the conditional probability 
Pr'sp<z<nyaz(yo < y < yo + dy}. Use these results and Theorem IV to 
get an approximation to pr{xo < x < to + dx and yo < y < yo + dy}. 
Express this final result in the form ¢(%o,yo)dx dy, and this gives the 
form of the joint density function g. 


We omit all proof that this rule does what it is advertised as doing. 
In the discrete case this follows at once from (d) of Theorem II. In 
the continuous case we meet with the same difficulties that caused us 
to skip the proof of Theorem V, Chap. 2. 

«Irn Step (a) the question of which set of component events to call X 
and which to call Y is decided by looking ahead to (d) or (d') and seeing 
which set of conditional probabilities can be computed readily. After 
a little experience the student will find that in some cases it makes no 


46 THEORY OF PROBABILITY [cuap. 3 


difference, while in others the problem is casy one way but practically 
impossible the other. 

For many situations Theorem VI is unnecessarily complicated, and 
we do not want to be quoted as recommending an unduly laborious 
presentation for every problem the student works. We have merely 
tried to give full and explicit instructions for the benefit of anyone who 
is having trouble. Here is a short summary of this working rule that 
may prove more useful than the formal rule itself. If a set of events 
can be arranged naturally in a two-dimensional array, arrange them 
that way and assign abscissas and ordinates. The joint probability 
function will be given in the discrete case by the probabilities of the 
various points in the array. The form of the joint density function 
will be indicated in the continuous case by the probability of a repre- 
sentative square dx by dy. Compute these probabilities by any means 
that come to hand. Theorem IV will frequently be of use in this 
connection. 


23. Independent Stochastic Variables 


Before turning to specific examples, we should introduce the notion 
of independent stochastic variables. A common-sense definition of this 
notion would be that x and y are independent if probabilities for y do 
not depend on values of x. Stated in terms of the notation we ave 
using in this chapter, this would mean that ga,(y) does not depend 0? 
a or b. If this is the case, we get the same result for every a and b; 80 
setting a = —%, b = «, we have 


b an 
Ji e@max [7 ende ay) 
[Pied [evar 


for every a, b. Thus for independent variables 


fi e@mar = gy) f? 1ds; 


and, integrating with respect to y, we have 


JE [i adz ay = [7 gway f sear, 


which means Theorem IV will read pr{AB} = pr{A}pr{B}. Furth" 
jom! minre Shes above must hold for every a, b, c, d, it follows the 
e(t,y) = f(x)g(y). This is another of those things which we have 2 


intenti yi ; y 
oe ace ee to ae in an elementary course, but we can gene? 

: uspicion that it would ase by i zn 
Barty 4 da c uld be the case by letting 4 


= Yo, and d = yy + dy: 


Jan(y) = 


SEC, 23] JOINT DISTRIBUTIONS 47 


otdr Yo y +d.: o+d] 
ete pete ce yay de = [E7 ode fe gudo 


To 
Now, we make our usual approximation and have 
elxoyody dx = f(xo)dx g(yo)dy; 
and, dividing by dy dx, we get ¢(xoyo0) = f(vo)g(yo). Tt is the use of 
the approximation that keeps this argument from being rigorous. For 
the benefit of those who are interested, we might add that a genuine 
Proof would consist in applying the first mean value theorem for inte- 
grals (Franklin, Treatise on Advanced Calculus, page 201) to get the 
result in case the functions are continuous and then noting that inte- 
srability will require continuity at so many points that we can change 
hings to make the result hold everywhere without altering the value 
of any integrals. 
Finally, let us note that it is trivial to go back from this last result 
© our starting Dointe4.e, ite) = f(x)g(y), then ga,v(Y) is independent 
of a and b; for, in this case, 


Yax(y) = 14 = g(y). 


A Similar circle of implications could be given, starting with the propo- 
sition that fe a(x) is independent of ¢ and d. Thus we have four equiva- 
ent statements which we use to define independent stochastic variables. 


Definition 2. The stochastic variables x and y are said to be inde- 
Pendent if the following four equivalent conditions hold: 


5 gesty) = 9(y) for every a, b. 
a feala) = f(x) for every c, d. 
a prla<x<bande<sy <d} 
=pria<axs biprie Sy < d} for every a, b, c, d. 
a v(xy) = f(x)g(y) for every X, Y- 

For practical purposes, note that this equivalence means that to 
hove g 3 e need prove only that some one of 


and y are independent W i 
x PN 1 y are known to be independent, 


1. 8¢ Conditions holds, while if x and 
pes it follows that all four conditions hold. , 
he phrase ‘independent events” is frequently used in probability 
cory. We shall say that two events are independent if they can be 


described, one by a set of x values and the other by a set of y values, 
Whe: A chastic variables. Conditions (a), 


re x and y are independent sto f | 
and (c) of Definition 2 then translate immediate 


(b) 


ly: 


‘ 
48 THEORY OF PROBABILITY [cuap. 3 


Theorem VII. For independent events A and B, 


(a) prat{B} = pr{B}. 
(b) Pral A} = prf A}. 
(c) pr{AB} = pr{A}pr{B}. 


Furthermore, any one of these conditions will guarantee that A and B 
are independent. 


Generalizing condition (d) of Definition 2 for independent stochastic 
variables, we say that n stochastic variables are independent if their 
joint density function is the product of n functions each depending on 
only 1 of the variables. If n variables are independent, any 2 of them 
are; but the converse is not true. Consider 3 variables x,y, and 2, each 
of which assumes the values 0 and 1. The joint probability function 
(x,y,z) must then be defined at each corner of the unit cube; let it be 
defined by the numbers at the corners of the cube in this diagram: 


Now, to get (x,y), we add along each line parallel to the z axis and 
discover that g(x,y) is 14 at each of its 4 points. Similarly, adding 
with respect to y and z, respectively, we find that 0(2,2) and w(y,2) 8° 
each identically 14. From these squares with 14 at each corner, Tu 
easy to see that another round of adding gives f(a) = Y% at each of 
bts Sa the same for g(y) and h(z). When we try to build oe 
y multiplication, we see that f(x)g(y) = 1 Wy = 14 = ope 
for each of the 4 pairs of values of x ce a oly 2) = J EWO 
at all 4 points in the az plane, and p(y,z) = IDRE) ai all 4 points 
However, S(@)g(y)h(2) is identically equal to 1K, and ®(2,y,z) iS 
equal to 34 anywhere. Thus, x and y are independent; x and z 9° 
independent; y and z are independent; but x, y, and z 0 not. 


SEC. 24] JOINT DISTRIBUTIONS 49 


24. Joint Density Functions—Independent Case 

If, in setting up a two-dimensional stochastic variable representation 
of a situation, we find that the events we have represented by y values 
are independent. of those represented by x values (and the physical 
Situation will usually show very clearly that this is or is not the case), 
then we shall want to make x and y independent stochastic variables; 
and the problem of setting up the joint density function is considerably 
simplified. Technically, Theorem VI still applies; but practically it is 
far too complicated. In its place we suggest the following: 


Theorem VIII (Special Working Rule). Follow through steps (a), (b), 
and (c) of Theorem VI; then, if x and y appear to be independent, set 
up g(y) by means of Theorem III or V, Chap. 2 (whichever applies), 
multiply by the f(x) obtained in (b) of Theorem VI, and use this 


Product for (x,y). 


It is worth noting that, if v and y are independent and each has a 
Constant density function, then g(x,y) will be constant also. The con- 
tinuous case is particularly worth studying under these circumstances. . 

vis constant, then g(x,y) = 1/4, where A is the total area covered 
by all possible results. With g(x,y) identically 1/A, the integral of 
over any area A’ inside this just gives A’/A. These observations 


Suggest the following: 


Theorem IX (Very Special Working Rule). In the continuous case, 
°r two independent stochastic variables, each with a constant density 


function, the probability of an event described by pairs of values of x 
and y is given by the area representing the favorable cases divided by 


© area representing all possible cases. 


s rule, and for these problems the solu- 
‘are must be taken, however, not to use 

corem IX where it does not apply; and the most likely source of 
trouble is that x and y can have constant density functions and still not 
come under Theorem IX. Independence must be checked in addition 
© this. Consider, for example, a joint density function 


i Many problems fall under thi 
mon is certainly very simple. C 


š : r H 2r a < =< 
e(t,y) = is (1 — sin v sin y) for 0 < v < 2r and 0 Sy < 2r, 


0 


otherwise. 


50 THEORY OF PROBABILITY [cmar. 3 


Obviously, probabilities are not proportional to areas in this case; yet 
the marginal density functions are both constant. 


25. Example—Three Urns 


Suppose we have three urns containing balls as follows: 


Urn I: 3 white, 2 black 
Urn II: 4 white, 7 black 
Urn III: 5 white, 4 black 


Our procedure will be to choose an urn at random and then draw ® 
ball from it. Two specific questions that might be raised are: (a) What 
is the probability that the ball drawn is white? (b) Given that the ball 
drawn is white, what is the probability that the urn chosen was urn J 
Finally, we might set up a complete representation of this situation by 
two stochastic variables. : 

To answer question (a), we note that a white ball can be obtained in 
any one of three mutually exclusive ways: 


Urn I, white ball 
Urn II, white ball 
Urn III, white ball 


‘ ' eae aman De 

Each of these is a compound event, but their probabilities can l i 

pem T: 

computed from Theorem IV because the conditional probability fort 
color, given the urn number, is obvious. So we have 


pr{I w} = pr{l}prifw) = 14 X 35 = 15. 
prillw) = pri jprufw) = 14 X 41 3- 
pr{lil,w} = pr{(II}prufw} = 14 X 54 = 547. 
Applying Axiom C, we get 
1 4 


5 _ 752 
priw} = 5 +35 + oF = fags 


š a 97 
; Question (b) calls for a direct application of Bayes’ theorem, Eq! 
tion (1), See. 21: ` 


pro{I} = pr{lyprifwy S 

pri\prifw} + pri} prifw} + pri II jpru lw) 

1 Xx 

Soe A 

XI FKX 
974 59. 


5 


f 
| 
Hi + 4 x 56 


SEC. 2 z 
c. 26] JOINT DISTRIBUTIONS 51 


th Te a Suvaitnstie variable representation of this situation, we note 

“orn a z ev ik under consideration is characterized by an urn and a 

We ra A hie let x represent the urn chosen and Y the color drawn. 

Wet ns lan the values of g(x,y) in answering question (a). The 

“pa s oe ound in a similar manner, and the complete picture (includ- 
g marginal probability functions) looks like this: 


y gy) 
b 733 2 7 4 
148 I5 388 
» @ 2 Be 
1,485 5 B XN 
1 1 1 
3 3 3 f(x) 
I II Ill 2 


26. Example—Points Chosen at Random 

n independently and at random between 
ble representation, we let x be the posi- 
position of the other. Then each 
ane represents a way of choosing 


ð Suppose two points are chose: 
a 1. For a stochastic varia 
on of one of the points and y the 


Poing in the unit square of the xy pl 
he two points between O and 1. With x and y chosen independently 


and at random, we sce that Theorem IX applies. Even that rule is 
Simplified here because the total area is unity. So, to find the proba- 
bility of an event, we must find its representation in the zy plane and 
Set the area of that representation. 

For instance, suppose we want the probability that the two points 
chosen between 0 and 1 are within e of each other. Analytically, this 
Would be written |a — y| < ¢ and geometrically it is represented by 


the diagonal strip: 


y 


Fia. 6. 


52 THEORY OF PROBABILITY [cnar. 3 


The strip consists (as shown) of a rectangle v2 — (2e/+/2) by € V2 
and two triangles, each with base e and altitude e; therefore 


prille — yl Se} = eva(v2 - ta) po = 2e — è. 
V2 2 


Let us modify the problem so as to make the variables dependent. 
Suppose the first point is chosen at random between 0 and 1 and then. 
the second point is chosen at random to the left of it. We let x and 
y have the same significance as before, note that f(x) is unity from 0 to 
1 and zero elsewhere, and proceed to (d’) of Theorem VI. If x = to 
then y is chosen at random on a segment of length xo; so 


dy 

= for 0 < <= oF 
Prz=zx{Yo0 < Y < yo + dy} = 4% BE ee Pe ae 

0 otherwise. 


Now, since f(x) = 1 for0 <a <1, 
dx for 0 < a <1, 
pr{xo <a < to + dr} = 
K 0 otherwise. 
So, using Theorem IV, we have 
pr{to < x < xo + dx and yo < y < yo + dy} 


dx dy for 0 < yo < 20, 0 < to < 1, 


=~ Xo 


0 otherwise. 
Therefore, (x,y) = 1/z inside the triangle 0 < y <2,0<2<1 and 
y is zero elsewhere. 

1 We should note in passing that, even 
though 1/z is the product of a functio” 
of x alone (1/x) and a function ° 
alone (identically 1), g(x,y) is no 5$ 
thing. This is because the formul? 
= 1/x applies only in a triangle. 
(d) of Definition 2 does not apply 
These variables are definitely depe” if 


7 —=* ent. This appears very clearly i 
Fia. 7. we compute the marginal dens! 
functions: 


ll 
= 


ro = f” eena- [ia 


gy) = i __ elayy)da i “Lae = log (2) 
* yt "Ny 


I 


l 


1 


EE a 


SEC. 27] JOINT DISTRIBUTIONS 53 


Obviously, f(xjg(y) is different y 
from g(x,y). / 

For this dependent case the prob- 
ability that the points between 0 
and 1 are within e of each other can 
no longer be found by Theorem IX, 
but (e) of Theorem I will do the job 
very nicely. Since g(x,y) = 0 for 
Y > x, we have only half the strip 
We had before. For purposes of 
Setting up limits of integration, we 
divide our strip into two sections. Fra. S. 
Now, by (c) of Theorem I, 


pr{|x — y| <e} = TI glx,y)dy de + Il, o(x,y)dy dx 
A 
e fz | z al 
tT Layae+ f zw de 
ojo & a Jae 
log 1), 
e(1+ og = 
27. Example—The Buffon Needle Problem 
Another illustration that fits Theorem IX is the Buffon needle prob- 
lem. A board is ruled with equidistant parallel lines, the distance 


between consecutive lines being d. A needle of length a < dis thrown 


on the board. What is the probability that the needle will intersect 
One of the lines? Let us consider the relation of the needle to the line 
a8 touches or the nearest line below it. We shall characterize its posi- 
tion by means of the variables y and @. We now assume that y and 0 


e A 


I 


a y 
8 


Fia. 9. 


have constant density functions and are independent. The possible 
Cases are described by the conditions 0<60<7, es y <4; the 
favorable cases by 0 <y<asinð. By Theorem IX, the required 
Probability is Borie to ‘the shaded area divided by the total area: 


54 THEORY OF PROBABILITY [CHAP. 3 


frasin odo 


p= zd ad 


y=asin@ 


T 
Frc. 10. 


This problem was first mentioned in print by Buffon in 1733 and 
attracted attention for years after that. 


Its chief fascination lay in 
the fact that the number + appear 


s in the answer. The idea was that 
from the result Pp = 2a/rd one could get the “formula” r = 2a/pd; 
then an experimental evaluation of p would lead to 
tion of r.” Uspensky (Introduction lo 
112 to 113) cites the results of several such experiments in which the 
answers are quite reasonable. However, these can hardly be regarded 
as verification that m = 3.14159 |, Instead, they serve as evi- 


dence that we were correct in assuming that y and @ are independent 
variables, each with a constant density function. 


a simple ‘‘computa- 
Mathematical Probability, pages 


28. Example—Genetics 


for most of our later discussion. It should be 
on principle (Axiom C) and the 
) may be used quite effectively 7 
S concerning probabilities, with 2° 
c aoe at all. blem 
> is is TE > proble 
from the field of genetics. Each a bea MEMPEN ee 
» €ye color in human beings, ete.) determined by genes: 
organism has two genes, each Fi 
genes are usually designated by 7 4 
*Pesare AA, Aa, and aa. In the proces 


(Theorem IV 
imple problem; 


SEC. 2 

8] JOINT DISTRIBUTIONS 55 

of repr š ee P 

i a productis, the offspring inherits one gene from each parent. If 

i A is of type AA or aa, the inherited gene must be A or a, 
Sp ctively. An Aa parent may transmit either kind of gene with 

Probability 14, 

ae now, that we have a population (called the zero generation) 

in the genes so distributed that if a member is chosen at random, the 
Obabilities of the various gene types are: 


AA Aa aa 
p 2q r 


ipee are chosen independently and at random from this popula- 
fies J v are the probabilities for the various possible gene types in 
Next generation? 

a result in the first generation is the product of four independent 
of hee the gene type of each parent and the choice of a gene from each 
age We shall designate such a result by listing the type of one 
trang the gene transmitted, the type of the other parent, and the gene 
he a pi For instance, AA, A; Aa, a will mean that one parent 
other ty pe AA and (necessarily) transmitted an A gene, while the 
assii Was of type Aa and transmitted an a gene. Because of the 
Vir — independence of these four events, we can use (c) of Theorem 
~ to get the probability of such a fourfold event. 

he type AA in the first generation can result in any of the following 


m 
Utually exclusive ways: 


Result | Intermediate steps Probability 

A; AA, 4 p-1l-p-l p? 

é p:1-2q-}3 pa 

A, 4 2q-}e-pol pa 

2 À q- 14- 2q- 38 g 
eee | E 


Thus, by Axiom C, the probability of AA in the first generation is 


p +21 +t = (p+ 9- 


To get Aa in the first generation, we have the following mutually 


exclusi at 
usive possibilities: 


Result Intermediate steps Probability 
AA, A; aa,a p-l-r-l pr 
, A; Aa, a p:1-2q-}8 Pq 
~ 2q-15-r-1 qr 
24 - 34 2q- 38 g€ 


56 i THEORY OF PROBABILITY [cuar. 3 


There is a symmetric set of results yielding aA (which amounts to 
the same thing); so the probability of Aa is 


pr + pq + qr + q =Ap+Qq+n. 


By symmetry, the probability of aa is 


@+7), 
so we have the following table for the first generation: 
AA Aa aa 
(p+)? 2(p + g)(q + 1) (q+r)? 


Using this table, we can now study the second generation. re 
ever, there is no point in working the problem over again, Instead, le 


P= (p+ @, w 
Q=(P+Qqt+n, a 
R= (q +7). j 


: eed amo (OF 
Then use the previous results, Now, if we remember that because 


their original Significance, p + 2g + r = 1, we find that . 


(P +Q)? = (p + Q)%p + 2% 4 r7)? = (p+ 9)’, 
P+ OQ + R) = (p + olp + 20 +a E Np +29 +7) : 
= (p+ qa + r)» 
(Q+ R= (q+ r)?(p + 29 + r)? = +r). 
That is, the probabili 
those for the first, 
generation, the prob 
tion on. 

The student shoul 
bilities—computed o 
ation only. If the 


' s 
ties for the second generation are the same 2 
Thus, no matter what the distribution in the a 
ability distribution is stable from the first gene? 


oe 
d note carefully that these are all a priori p! Pe 
n the basis of an assumption about the zero ge? p 
population is very large, we can show (see Ch® 


an actual distributio indi riori proba 
distribution. How. 
should all þe clos 


probably) be stable at all. The question of t a 
for the actual distributions is a very complica 


i, . 
Joi 


. rawn fr k ; 5 
wn from B was white, what is the prob: 


JOINT DISTRIBUTIONS 57 


REFERENCES FOR FURTHER STUDY 


pide, An Introduction to Mathematical Probability, Chap. II. 
aid and Roth, Elements of Probability, Chaps. IV, VI. 
spensky, Introduction to Mathematical Probability, Chaps. II, XII. 


PROBLEMS 
1. What is the probability that each of the 4 players in a bridge 
Same will be dealt a complete suit of cards? Ans. 41(18!)4/52!. 


_ 2. Each of two urns contains 5 white and 7 black balls. One ball 
is drawn from each urn. What is the probability that both balls drawn 
are white? Ans. 25444. 
3. The urns of Prob. 2 are emptied into a third urn, and 2 balls are 
drawn simultaneously from this third urn. What is the probability 
that both balls drawn are white? Ans. 1362. 
ai 34. Three urns contain, respectively, 2 white and 3 black balls, 4 
"Hite and 2 black balls, 3 white and 1 black balls. One ball is drawn 
rom each urn. What is the probability that among the balls drawn 
F are 2 white and 1 black? Represent this problem by three 
Ochastic variables, and draw a picture. ‘Ans. 1460. 
+i Bs Two urns contain, respectively, 4 white and 3 black balls, 3 
White and 7 black balls. One urn is selected and a ball drawn from it. 
hat is the probability that this ball is white? Ans. M440. 
6. Represent Prob. 5 by two stochastic variables; describe the 
R nt density function, and answer the following question: Given that 
pe ball was white, what is the probability that the urn chosen was the 
rst one? Ans. 4%. 
ae Urns A and B contain, respectively, 2 white and 1 black balls, 1 
ite and 5 black balls. One ball is transferred from A to B, and then 
a ball is nfrom B. What is the probability that this ball drawn 
tom B will be white? Ans. 541. 
_ _ 8. Represent Prob. 7 by a pair of stochastic variables. Set up the 
Joint, density function, and answer the following question: If the ball 
ability that the ball trans- 


*rred from A to B was also white? Ans. 4. 
eee Three urns contain, respectively, 2 white and 3 black balls, 1 
B = and 5 black balls, 6 white and 2 black balls. An urn is chosen at 
hie om and a ball drawn from it. If the ball drawn is white, what is 

© probability that the urn was the third one? Ans. +4549. 
0. Two urns contain, respectively, 1 white and 1 black ball, 2 white 


and no black balls. One urn is selected at random. A ball is drawn 


58 THEORY OF PROBABILITY [cnap. 3 


and replaced; then another drawing is made from the same urn. Each 
draw yields a white ball. What is the probability that the urn selected 


was the second one? Ans. 46. 
11. The same as Prob. 10, but 3 drawings were made, each yielding 
a white ball. Ans. 89. 


12. An urn contains 10 white and 10 black balls. Five balls are 
transferred to another urn, and samples are drawn from this second 
urn, one at a time, with replacements. If 5 such independent samples 
are all white, what is the probability that the second urn contains only 
white balls? Hint: See Prob. 21, Chap. 2. 

Ans. 887,400/4,425,452- 

13. In a sequence of throws with 2 dice, what is the probability of 
throwing a 7 for the first time on the nth throw? Ans. KGE 

14. Show that the probability is 1 that a 7 will be thrown some time 
in an indefinite sequence of throws of 2 dice. E 

15. What is the probability that the first 6 will precede the first í 
in a sequence of throws with 2 dice? Ans. mr 

16. The game of craps is played as follows: A man throws 2 dice- 
If, on his first throw, he gets either 7 or 11, he wins the game. If, on 
his first throw, he gets 2, 3, or 12, he loses. If, on his first throw, he 
gets 4, 5, 6, 8, 9, or 10, he continues to throw the dice until he cithe" 
duplicates the total obtained on his first throw or throws a 7. pus 
total on this final throw duplicates his first total, he wins. If the fin® 
throw is a 7, he loses. What is the probability that the man with the 
dice will win? Ans. 244495: 

17. Take Definition 1, Chap. 2, as a definition of probability ant 
Theorem V of this chapter as a definition of conditional probability: 
and prove the multiplication theorem (Theorem IV). 

18. An urn contains n balls, numbered 1,2, . . , ,n. Two balls are 
drawn one after the other, without replacement. Let x be the numbe! 
on the first ball and y the number on the second. Set up the jo? 


probability function for x and y, and show that the marginal probab! í 
ity functions are the same, 


19. Generalize Prob. 18, Tf all the balls are removed, one after eu 


-js the 

n for the number on any given draw 18 t 
2 

ations of the numbers l, 


; t 38 
osen at random on a line segment. What , 


g Ae 
: e 3 segments determi ; -m the sides 
a triangle? 8 determined can form t / 


V4. 
Ans. iad 


JOINT DISTRIBUTIONS 


59 


25 rm i . . 
ika : he same as Prob. 21, except that one point is chosen at random; 
ne other is chosen at random to the right of it. 


2: Neo = ais is 
3. Given the joint density function 


(x,y) = | 


Bx y + dyPc 
0 


Ans. 


— 1⁄4 + log 2. 


fr0 <x <land0<y<l, 


otherwise, 


find the:wondia; z 
p4 e conditional probability prs <=-<3i1!8 < Y <S 24}. Ans. 71451. 
» Find the marginal density functions in Prob. 23. 


Ins. 


aT. 


38a, y + 3y”. 


25. Ar 3 : P : 
Are the stochastic variables in Prob. 23 dependent or inde- 


Pendent? Why 


id 


ey cach of Probs. 26 to 31, a joint density (or probability) function 
ty) is described. It is understood that g(x,y) = 0 outside the 


reg} i 
egion specified, 


In each case apply (d) 


Whe i 
ether the variables are dependent or independent. 
+ y)] (r $ 2 $7, —r S y <7). 


of Definition 2 to determine 


2 5 iy. 9 
8. (ay) = (1/4e2)[1 — sin(x 
Ans. dep. 
27. plx 1 T T 
elzy) = 5 [cos(x + y) + coste — MIETET <y S35): 
2 Ans. ind. 
ny elz) = dry (0<r<1,0<y <!) Ans. ind. 
50, (x,y) = Sey (OS a sy0Sy SD. Ans. dep. 
"T= 1,2, 3; y = 1, 2, 3; eG): 
Ve 156 P72 
He Ms 56 
Me % He Ans. ind. 
31, 
t= 1, 2, 3;y = 1, 2, 3; olz): 
lio Mo 40 
tío M Mo 
5 Yo Yo Mo Ans. dep. 
< = Let B(a,y,z) = (1/8r°)(1 — sin t sin y sin z) in the cube 0 < x 
ftn T 0 <y <w, 0 <z < 2r (@= 0 elsewhere) be the joint density 
Oe for the three stochastic variables 2%, Y, and z. Show that x 
Y are independent, y and 2 are independent, x and z are inde- 


Pendent, but 2 


the nt 


y, and z are not. 


; "i number is chosen at random 
à decimal place is a 7, and let €n 


between 0 and 1. 
= 0 if the nth decimal place 


Let x, = lif 


60 THEORY OF PROBABILITY [cnar. 3 


is anything else. Show that the stochastic variables Ti, To, o o- y Ùk 
are totally independent. Use Prob. 34, Chap. 2. Note tha to 
anticipate and solve Prob. 34, Chap. 2, by using (c) of Definition 2 in 
this chapter would be to assume the independence of these variables. 

34. A number is chosen at random between 0 and 1. Let a, be the 
nth decimal place. Show that the stochastic variables xi, v2, . . « , UE 
are totally independent. 

The A gene is dominant, the a recessive. This means that organ- 
isms of types AA and Aa will exhibit the same physical characteristic— 
that going with the A gene. Only type aa will exhibit the a character- 
istic, though type Aa can transmit it to offspring. In Probs. 35 to 40, 
assume that the a priori probabilities for the various gene types are 

AA Aa 
x Vy 


aa 
M 


35. If both parents have the dominant characteristic, what is g 
probability that the offspring will have the recessive one? Ans. 79° 
36. If one parent has the dominant characteristic and one the reces- 


sive, what is the probability that the offspring will have the recessive 
one? Ans. Ys 


37. If the offspring has the recessive characteristic, what is the prob- 
ability that both parents do? Ans. 7A: 
38. Tf the offspring has the dominant characteristic, what is ie 
probability that both parents do? 
j 39. If all four grandparents have the domin 
is the probability that the second-gener: 


4, 
Ans. rA 
a ak ‘ha 

ant characteristic, WP 
ation offspring will? j 
Ans. 7% 


40. If all four grandparents and both parents have the domina? 


characteristic, what is the probability that the second-generatio”® 
offspring will? Ans, p56: 


CHAPTER 4 
REPEATED TRIALS AND ALTERNATIVE EVENTS 


ee together in this chapter the discussion of two rather unre- 

of the ormulas. The first amounts to no more than a special example 

een combined use of the addition principle and the multiplication 

Ne aw i However, it has assumed sufficient importance in the study 

ikon shiliy to deserve special mention. The second formula gives 

on something akin to an addition principle for events which are not 
utually exclusive. 


2 F 
9. Bernoulli Formula—Physical Version 
1 picture is that of a sequence of 


For our first problem the physical 
t to consider a sequence of inde- 


rae In particular, we wan 3 
P ent but identical experiments. The first extensive study of such 
a7 uation was made by Jakob Bernoulli and published posthumously 
t 13). As a result, his name is usually attached to sequences of this 
Ype (Definition 1) and to the formula (Theorem 1) that applies to such 


Situations, 


oe nition 1. A Bernoullian sequence of trials is defined as a (finite 
infinite) sequence of experiments satisfying the following conditions: 
ce For each experiment, the possible results are classified as either 
Cess or failure. 
(b) The probability of success is 
(c) Each result is independent © 


the same for every experiment. 
f all the others. 


Tn succeeding chapters we shall have a great deal to say about 
aa gillian trials. It will facilitate matters if we adopt a standard 
A en in connection with them. The probability of success on an 

ividual trial we shall always denote by P- The probability of failure 

~ p) we shall denote by 4- 

ssion of a sequence 


of 1e phrase “sequence of trials” gives the impre 
repetitions of the same experiment. Obviously, such a sequence of 


tepet S 
€petitions is an example of & Bernoullian sequence, but it is not the 
61 


: 4 
62 THEORY OF PROBABILITY [cnar 


only one by any means. For example, the simultaneous seine 
of the same experiment on different sets of apparatus can be an i 
good-example. We want to think of the “trials” as wit amnion: 
ordered, but this ordering may be purely arbitrary. It need not apre 
anything to do with time. If we toss a single coin n times or throw : 
single die n times, we have an example of a Bernoullian sequence; bub 
we also get an example by tossing n coins or throwing n dice. 

It is not even necessary that we do n things to get an example of & 
Bernoullian Sequence. If we choose a single number at random 
between 0 and 1, we define an infinite sequence of decimal places each 
of which either is a 7 (success) or is not (failure). These events are 
totally independent and have equal probabilities of success (Prob. 33, 
Chap. 3); so with one twist of the wrist we describe an infinite Bernoul- 
lian sequence, f 

The simplest cases of sampling furnish interesting examples i 
Bernoullian sequences, If an urn contains black and white ae 
sample drawings (with replacement) are independent. events, ener 
with only two Possible outcomes and with identical probabilities- 


G i . BiB 
Therefore, a single sample aggregate of this sort may be regarded as 
Bernoullian sequence, 


Another type of sampling that is essentially Bernoullian is a spot 


check inspection on a production line, A factory is turning out pa 
gets of some sort, All are produced under essentially the same aon 
tions, but occasionally a defective product will appear. Apparently 
there is some fixed probability (characteristic of the factory) that © 
given item will be defective, Furthermore, the fact that a give? 
Product is defective nible effect on the probability a 
defects in Therefore, a single item taken off Me 
'nishes us with a single “trial,” ane 
Bernoullian sequence. ian 
might note is that a sequence of Bernoulli 
onally be regarded as a Bernoullian sequen” 
he methods of this chapter (or by more nae 
i: re is a certain probability p oe 
defective items, oll a given Production line will contain exac 


Cpa é 

trial” is a ge i tiv? 
; ems. “Success” means 2 defet 
items out of the 100. Success” means 


SEC. 29] REPEATED TRIALS AND ALTERNATIVE EVENTS 63 


sa — is usually not strictly Bernoullian. However, many of 
ater considerations concerning sequences of stochastic variables 
ae to 11) may be regarded as generalizations of the notion of 
th: ou ian sequences. The student would do well to get a picture in 
us simplest case of the many different types of situations that can be 
analyzed as sequences of “trials.” 
The fundamental problem in connection with Be 
cerns the probability of a given number of successes in a given (finite) 
number of trials. 


rnoullian trials con- 


: Theorem I (Bernoulli’s Formula). The probability of exactly r 
Successes in n Bernoullian trials is 


aC yp gr": 
h either a success or a 


f Consider a particular sequence of n results, cac 
ailure; 


SSFSFF >- FA. 


D Probability of each S is p, and that of each F isg. Remembering 
oe the trials are independent and using (c) of Theorem VI, Chap. 3, 
e have that the probability of this particular sequence 18 


ppapaq ` ` © gp =P 


N es 
OW, “> successes” means any one of the sequences containing r 


mecesses and n — r failures. These different sequences represent 

Pa exclusive events, each with probability p’q"*- ; The number 
hm sequences is clearly the number of ways of putting the S’s in 
lee ee Positions on the framework ofn places. Since there are r of 
he es S to be placed, the number of sequences 1S Npa Therefore, by 

tri addition principle, Axiom C, the probability of r successes in n 
lals 1s aO prg”, 


From Theorem I and the addition principle, we at once derive the 


fo = 
llow Ing corollaries: 


trig ary 1. The probability of at least r successes in n Bernoullian 
als ig 


n 


V aep p™. 
= 


AP. 4 
64 THEORY OF PROBABILITY [cn 


Corollary 2. The probability of at most r successes in n Bernoullian 
trials is 


§ opp. 


s=0 


30. Bernoulli Formula—Stochastic Variable Version 


In order to adapt much of the discussion in later chapters i ie 
important special case of Bernoullian sequences of trials, we pot 
formulate a stochastic variable description of such a sequence of eh ia 
ments. If we perform a set of n experiments, the record of all n ere 
describes a compound event with n components—each componen: = 
result of one experiment. Thus, we describe the whole set of pra 
ments by an n-dimensional event space with each of the n goon a 
representing one experiment. Let us call these coordinate varia ate 
@=1,2,...,n). Note that we are using x; to denote a ae 
not a specific value of some variable x. The convenience of this ati 
tion will soon be obvious, and the student must overcome the ha 


Š < > ote 
formed in analytic geometry of using subscripts with x to den 
PR ctanie 


This n-dimensional representation could describe any set of n on, 
but in the Bernoullian case we can be much more specific. Accor a a8 
to (a) of Definition 1, each variable x; assumes only 2 hg ei we 
representing failure, the other success. For each variable, let the 
values be 0 and 1, respectively. Then the event space consists 0 
2” corner points of the unit “cube” in n space. 

Now, (b) of Definition 1 tells us that each variable x; should ha iyi 
probability function f;(x;) whose form is the same for all i’s; name 


ve ® 


w 0 1 

Flai): q Pp 

The independence cri 
chastic variables Ti 


n 
P(r, ta a., En) = I] £@. 
i=1 
« esse® 
_ To see the mathematical Tepresentation of the number of suc 4 we 
In v trials, all we have to do is use the right words to describe wha á 


SEC, 30] REPEATED TRIALS AND ALTERNATIVE EVENTS 65 


have already said about the stochastic variables x: Suppose we put 
it this way: The value of z; is the number of successes on the 7th trial. 
This is exactly what we said before; and when we state it this way, it is 
clear that the number of successes in a given sequence of trials is the 
sum of the values assumed by the 2’s for that particular sequence. In 


symbols, 
n 
r= y Ti. 
i=1 


Translated into these terms, Theorem I reads as follows: 


Theorem II. If 21, To ++ + y Tn are independent stochastic vari- 
ables each of which assumes the values 0 and 1 with probability g and 


P, respectively, then 


dr 


isa stochastic variable with probability function 
f(r) = »C,p'q"- 


This is merely a restatement of Theorem I and therefore follows 
from it, However, a direct proof of Theorem II in terms of prob- 
ability functions might help to clarify some of the concepts intro- 
duced here, 

We have already noted that 


(a1, X2, z a Dn) = Il fii); 
i=l 


n 
therefore if we have a point in n space for which ` a; = r (i.e. one 
1 

Whose coordinates consist of r ones and n — ” zeros), aye of 
? at that point is the product of r factors p and n — g o ars g 
- other words o = pg" at each such point. Thus, pr{ Za; =) 
y P’g"~" times the number of corner points of the unit cube for which 

ti = 


=j = 
Perhaps the easiest way to count these corner points is to repeat the | 
argument used in proving Theorem I, noting that the corner points on 
the unit cube are given by sequences of zeros and 1’s. However, the 


HAP, 4 
66 THEORY OF PROBABILITY [CHAP. 
following geometric picture mi 


ght prove instructive. Let us draw 
some skeleton cubes.” 


2 dimensions 


3 dimensions 
Fig, 11. 
The hyperplanes Ir; = constant stand out much more clearly if we 
collapse these cubes like this: 


re] 


2 dimensions 


SEC. 31 
] REPEATED TRIALS AND ALTERNATIVE EVENTS 67 


In eac 
for oe oe it is clear that the number of corner points 
thess dlctahen y diy the formula *C,. Furthermore, if we study 
made by lhe = pes a little more closely, we note that each one is 
corner point 2 he preceding one and drawing a line through each 
on the (n ry i o, each corner point on the n cube blossoms into two 
Thus tha: names ihe oas shifted to the left, the other to the right. 
from foes i points for a given value of ron the (n + 1) cube come 
thiensiest. ae ne for poh randr — 1. A comparison of 
shows that the —— C, = "C, + "Cr (Prob. 18, Chap. 1) 


r{ Xa; = = "Cpg 
must hold in general, dies “ mie 


31. E 
xample—Random Walk Problem 
ves along the x axis in 


Suppos : 
. SUppose a point starts from the origin and mo 
forward or backward, 


Jumps ; 
dea cach. Each jump may be either 
is 1g. Turt) HiSüoe that at each step the probability for each direction 
all the Sthen rora r we shall assume that each jump is independent of 
ber of point oe Aften n jumps the point might be at any one of a num- 
of its bein s ranging from —n ton, and we want to find the probability 
Let x G at each of the possible points in this range. ae 

Then th jim 1, 2, . . . , n) be the displacement on the ith jump. 

» the xs are independent stochastic variables each having the 


Probability function 


: =l 
fla): 1 H 
the abscissa of the 
] displacements. 
we have 


point) after 


| Now 
Jumps is n net displacement (?.¢., 
S the sum of these n individua: So, letting x be 


© absciss 
Cissa of the point after n jumps, 


i ese yari H 
Variables do not fit Theorem II, but the variables 


ai +1 
a= ar 


2 


do fit. 


Therefore, from Theorem II, we have 


= 1 r 1 pr 2 aC, 
Seed OOF 


= “HAP. 4 
68 THEORY OF PROBABILITY [cHAP. 


To find out about x, we note that 


n n 

we I ztn 

a= gpeg =e 
i=l i=1 


Thus, = a means Yz; = (a + n)/2, whence 


nC 
(atn)/2 
prie =a} = ee 


That is, the probability function for x is 


f(z) = Cosas 
In the two-dimensional random w 
each jump the point may move | unit 
each with probability 14. Our first 
the ith horizontal displacement 
The trouble with this is that, for 
because we have assumed that, if 
given jump, there is no vertical 
versa. 
Instead, let us tilt the x and y 
tions of the jumps. Then, let x 
joint probability function Qi 


alk problem, we assume that at 
forward, backward, up, or down, 
inclination might be to let æ: be 
and y; the ith vertical displacement. 
each i, x; and y; fail to be independent 
there is a horizontal displacement on a 
displacement on that jump, and vice 


axes at an angle of 45° with the dires- 
«and y; be stochastic variables with £ 
(x,y) described as follows: 


Yi g:(y:) 
1 A 


L 1 M4 4 
= 4% YY k 
= 1 ti 


Clearly, p; = f; - 9: at each point; so the variables are independent- 
Furthermore, they describe a random displacement of length V2 in ® 
direction tilted at 45° to the axes, _ We could change scale to make ns 
Jumps of length 1, but it is 


simpler to leave them the way they 
are, 


construction gives us 2 i 


Me 4 Be Hy o oa y Hae 
independent of all the others, Gio ot, 


Therefore, the two sums 


and y= 


SEC. 32] 
: REPEATED ' 
REPEATED TRIALS AND ALTERNATIVE EVENTS 69 


are independe 

Gaim veg a each other. This means that the joint probability 
functions Ia) Apia m product of the two marginal probability 
t J) and g(y). These last two are exactly w. 7 i 
he e e nesters as are exactly what we found in 


(ay) = S(@giy) = n¢ a n)/2 "Cw4ny/2 a "C(epny/a "C w+n)2 
2" Qn 4" 


gives a joi n Gi 
oo. en probability function for the abscissa and ordinate of the 
8 point after n jumps. 


32. 
The General Addition Formula 


The s 
ec n a PES 
this: we problem we want to consider in this chapter is roughly 
at modifications must be made in the addition principle if the 


events g 

Sar . z z A 

rather e not mutually exclusive? The answer to this question looks 
jal notation. Let 


compli x 
Hs, By mplicated unless we introduce some speci 
% ..., EB, be any set of n events. Then, let 


Sı = ) prlB) 


Se = Y pr{ His) 
hd, 


ij 
S: = y pr | E:E;Er} 
igk 


S, = pr{ Er (iia En} 
s are taken to mean the sum 


whe 
re th " 

e summations over several indice: 
tions (not permutations) of 


of all ter 
erms obtained by taking combina 


eg; ) 
S. For instance, if there are three events, Se is 
Not pr{EyE2) + pr{ BBs} + prials}, 
brik 
we) + pr{ EBs) + pr{ Bal) 
+ pr{E:Es} + pr{ EE} + pr{ 3B}. 
The 
rem III. Given any set of n events, By, Boy = Be 
Dr : = 
12 E} = 8-8: +8:—7°° + S. = ), (“DMS 
ee ped 


events Æ; as point sets in 


To Prove : 
à e this formula, let us think of the 
of points belonging to one 


even 
Space. The event SE; is the set 


70 THEORY OF PROBABILITY [cnar. 4 


or more of the individual sets Æ; and pr{2Z;} is the integral of a 
joint density function over this set. Each of the probabilities appear- 
ing in the sums S, is the integral of this same density function over 
some set. However, the sets over which we integrate overlap. Fur- 
thermore, sometimes we add the integral, and sometimes we subtract. 
Let us take a collection Ba, Li, . . . , Z;, of the Ẹ’s and designate by 
R the set of all points of SZ; that belong to each of these sets and to no 
others. Note that in general R is not the product of these sets. A 
point in the product might belong to other Z’s too. Now, each point 
in XÆ; belongs to one and only one of these sets R. Therefore, we 
shall prove the theorem if we show that, for an arbitrary set 2 of this 
type, the right-hand side of the formula gives a net result of exactly 
+1 so far as the integral over 2 is concerned. 

Suppose, then, that R is the set of points belonging to cach of the 


sets Ea, Em . . . , Es and to no others. A product set HE; will an 
tain the whole of R if all the subscripts over which we multiply a 
taken from among the indices Dis Day: w io 5 idee Otherwise, the produc 


set will contain none of R. So the integral over R appears in the sum 
Sr only if k < r, and then it appears 7C, times, once for every combina- 
tion of k subscripts taken from the r indices associated with R. Phere 
fore, the net count on the number of times we integrate over I? is 


r 
= 


> (=1) Ch. 
K=1 
Now (Prob, 20, Chap. 1), 


r r ‘i 
O= D(C = Cy 4 2 (ee, Y (oye On 
EZO Foal =A 
whence 

>» (-1)*1 r= L 

k=1 


ia, y rates 
applications which illustra 


ne 
If n numbered balls are placed i 
t is the probability that no 0? 


SEC. 33 s 
] REPEATED TRIALS AND ALTERNATIVE EVENTS 


right P 1 
pocket. Then, the required probability is given by 


1 —pr { y E:}- 


The se j 
the hak term we compute by the formula. There are n pockets for 
Chap. 3) to go into, one of which is “right”; therefore (see Prob. 19, 
1 
r{E;} => 
- pri Bi} = 5 
Si = X prt Bs = 
7 


If the 7 A 
z ith ball is in the right pocket, there are n — 1 pockets for the 


Ith x 
to choose from; so 


p 


_ prel Bs) = 57 
hen, T a 
» Theorem IV, Chap. 3, gives us that 


ss 1 1 
pini =a a-i 


Hence 
y "Ca 1 
è = », pri EBs) =p- 2 
Similarly, ij 
83 = . 3 "Cs oo 
i pri BEEd = pg ye — 2) 3 
ete, "py the 
herefore, the required probability is 
_1)! (—1)# 
1 — pr D z] a ia 5 Ci = > et 
a ` k=0 


f no corre- 


about this probability 0 
Second, as 


Ther 
e are two interesting things 
2 than for odd n. 


Spond, 
e : 
=> B First, it is greater for even v 
» it tends to 
—1)F 
(=)! _ gt = 36+. 


k! 
k=0 


„4 
72 THEORY OF PROBABILITY [cnar 


Using the above formula for the probability of no cprrespondenr 
we can easily find the probability of any given number of them. I ae 
certain set of r balls is specified, the probability that each of them is in 
the right pocket is 

1 


n(n —1)+- “(wn —7r+]1) 


The probability that none of the others is in the right pocket is given 
by our original result with n replaced by n — r: 


(=F 


k=0 

s na = ive 

There are *C, different sets of r balls, giving us *C, mutually camer, 
events of the form, “These r balls and no others are in the "g 


e 
pockets.” The event “exactly r correspondences” is the sum of thes 
mutually exclusive events; therefore its probability is 


"0, (=F 1 WH 
an= @—re TZ, Ho “A kt 
k=0 k=0 


REFERENCES FOR FURTHER STUDY 
Levy and Roth, Elements of Probability, Chap. V. 
Uspensky, Introduction to Mathematical Probability, Chaps. II, m1. 


PROBLEMS 


a 

i 1. What is the probability of getting exactly 3 aces in 5 throws of 
single die? at least 3 aces? Ans. 250/7,776; 276/17, 

2. What is the probability of getting at least 1 ace in 5 throw a 
a single die? in 4 throws? Ans. 5,651/7,776; 671/1 ow 

3. What is the probability of getting exactly 3 aces in a single ee, 
of 5 dice? Ans, 250/117 

= A coin is tossed and the cumulative heads-tails score a e 
Wh at is the probability that the heads total will reach 6 before : 
tails total reaches 4? Ans. Layee 

5. Generalize Prob. 4. Whati ili tting m he? 
Welt inctaligt at is the probability of getting 


m+n—-1 


mintr 
Ans. Jami 


r=m 


REPEATED TRIALS AND ALTERNATIVE EVENTS 73 

6. Generali . 
seuss tae alize Prob. 5. In a general sequence of Bernoullian trials 
ility p of success), what is the probability of m successes 


before n failures? 
m-+n-1 


ans: > minh pret, 


r=m 
J Sick A has 50 cents, and player B has $1.10. They want to 
Desine on game in which the winner will take the entire amount. 
Sestion: oe game that can be played by tossing a coin. Sug- 
heads 3 maa toss the coin 5 times, and A bets there will be exactly 2 
8. D ind at least two other simple solutions. 
cents a oe an equitable coin-tossing game 1n w 
9. Fr s stake is 90 cents. i 
wn, th rom a supply of black and white balls 4 balls are placed in an 
ae, — of each being determined by tossing a coin. A ball is 
tswa bi the urn and replaced. ‘This is done 4 times, and all 4 
white Ns What is the probability that the urh pe only 
spi ns. {70- 
ie he Buffon needle experiment (see Sec. 27) is performed 5 times 
is the i needle length one-half the distance between the lines. What 
probability that it will touch a line on 2 out of the 5 tries? 


2 3 
Ans. 10 ©) x ( = 2) : 
11. Whati T. ae 
throw, hat is the probability of throws at least 8 exactly 4 times in 
So ‘Ans. 35(542)! X (Aa) 
Tige hat is the probability of throwing exactly 8 at least 4 times m 
An, ws of 2 dice? 
13 pao CHo) + 91 (545) 8216)? + 7060)"}40) + M60) 
the pla ere are 4 balls and 4 urns. Each ball is placed in an urn, with 
may cement of each ball independent of the placement of the others. 
2b. n is chosen at random; what is the probability that it will contain 
ce Ans. 54456- 
hon experiment in Prob. 13 is performed 4 times. What is the 
Ans. 4X 54456 X (202456)*. 
etween 0 and 1. What is the 
n the first 10 decimal places? 
Ans. 99/10°. 
een 0 and 1. What is the 


hich A’s stake is 70 


7 


15. 
ee number is chosen at random b 
ility that there is exactly one 7 i 


16. 
A number is chosen at random betw 


Prol sys 
bability that exactly 5 of its first 10 decimal places are less than 5? 
Ans. 636 56- 
. 


Wo 


ry CHAP. + 
74 THEORY OF PROBABILITY [ 
T 


ach 
17. In a sequence of tosses of a coin a man e $1 — A Pris 
aa i x be his net gain after n tosses. ~~ oy > Cee 
S ch consists of equally spaced north-south paths oi A 
spaced east-west paths, intersecting to form a checkerboard Si n in 
rat is started from some intersection and wanders through the j cniee? 
a completely aimless fashion, stopping at each intersection oe anth 
ing one of the 4 paths at random. Find the ee 1 
intersection he comes to will be the one from which oe st Men )4/16" 
ANS. “Un 
19. Take a Bernoullian sequence of trials with p =g 3 fap 
the probability function for the number of successes in 7 trials, 
9 trials, 10 trials. r h of the 
20. What is the most probable number of successes in CRGA 3,3. 
four cases in Prob, 19? Ans. 2,20r3 (equally likely $ 
21. Show that the probability of r su s in n Bernoullian a 
divided by the probability of r + 1 successes is (r + 1)g/(n r that 
22. Under what circumstances are there successive values a inity- 
are equally likely? Hint: The ratio in Prob. 21 must equal oe 
Ans. When (n + 1)p is an in teger 
23. Show that the most probable value of r is the greatest pe 
less than or equal to (n+ 1)p. Hint: Theratioin Prob. 21 aii h 
asr increases. The most probable value of r is the first value for 
this ratio is greater than unity. 
24. When is 0 the most probable value of r? When is n the 
probable value? 4 ns. 
25. Show that th 
Bernoullian trials is 


CE E 1) Oy j. ” (1 — 2) de, 
Hint: See Prob, 26, Chap. 1. . 
26. Solve Prob, 7, Chap. 2, using Theorem III of this chanu in? 
27. What is the probability that at least one of the playet 
bridge game will be dealt a complete suit of cards? 14] 52! 
Ans. [16 X 13! x 391 — 79 X (131)? x 261 + 72 x (13) o hië 
28. What is the probability that the rat in Prob. 18 will return 
starting point at least once in the first 6 legs of his journey? 
Ans. 
29. An urn contains 5 


me 
white balls and 5 black balls. Ten 1% e 


ball is drawn and replaced, What is the probability that each © 
white balls is drawn at le 


Ang, j= 5(%0)" 4 10(84, 


most 

T): 
Whenp < 1/(n + 1); when p > n/(n t 1 
e probability of at least r + 1 successes 12 


? a” 
0)" — 10040)" + 5(%0)" — 4 


REPEATED TRIALS AND ALTERNATIVE EVENTS 75 
30. From a student body of 5,000, a student opinion poll takes a 
random sample of 200 student opinions. They do this by merely 
stopping a student on the campus, regardless of whether he has been 
questioned before or not. What is the probability that each of the 35 
members of a certain fraternity will be questioned at least once? 


10 


35 = 
_ (5,000 — r)* 
Ans. Sore (2h) ù 
=i) 


CHAPTER 5 
MORE ABOUT STOCHASTIC VARIABLES 


shastic 
Throughout most of Chaps. 2 and 3 we have looked at the Braem 
variable and its density function as the framework for the ma ca ae 
cal model of a physical situation. For all practical ponposd w asti 
have done is to jump from the physical situation to the A 
variable and its density function and from there to the event p pus 
the probabilities of events therein. However, this is a a e that, 
poses only. A quick look at Definition 2, Chap. 2, will ee. 
while this point of view may be practical, it is fundamentally bac a ale 
The event Space and its distribution function are the foundatio fune- 
ments of mathematical probability, and stochastic variables are ser i8 
tions defined over this space. The principal thesis of this eae 
just that. A stochastic variable is a function. The operation much 
formed on stochastic variables in succeeding chapters will be 
easier to understand if the student keeps this in mind. 


34. Functions of Stochastic Variables 


There is no reason why we sho 


50 
re done 
uld scrap the work we have d 

far, however. Let us look at 


jables 
it this way: Certain stochastic oe 
ables) and their density functions se! vent 
Thus, functions defined over ihe S 
functions of these coordinate pan n 
on is represented by a variable v # 


new 
A s , will be fU 
general stochastic variables will a dinat 


Space could be regarded as 
That is, if a physical situatic 
density function f(x), then in 


articular, u(x) = x describes the ¢00 
ction over the event space. syty funt 
seeing a density function (or probaly me aris? 
stochastic variable; so the question mig ons: 
stie variables in general have such fun er, 18 
this question is, “Not always.” Howea 
dinary integrals of j jputi?” 
d them as Stieltjes integrals of a diet to al 
), then the same form could be applie re 9 


SEC. 34] M 
MORE ABOUT STOCHASTIC VARIABLES Gi 


we can of this i 
the forms a et fact, let us discuss the special cases in which 
The dis br have adopted can be carried over. 
crete case can be handled in general. 


Theore 

m I, 3 i 

any oe one Let S be an event space, discrete arcontintions, Wits 

ube of dimensi ons. Let u be a stochastic variable defined over 

ties set of all possible values for wis a discrete set, then a proba- 
Unction g(u) is given by 


g(a) = pr{Ea} 


where H; 
4a is the set of points in S for which u = a. 


This is n 

In ica consequence of (d) of Theorem II, Chap. 2. 
We have Yi variable description of Bernoullian trials (Chap. 4) 
™ultidimensio y seen an example of a discrete variable defined over a 
ables, and th nal space. There u was the sum of the coordinate vari- 
Points on ie Ea referred to in Theorem I were the sets of corner 

Tt is worth subs contained in the hyperplanes =a; = a. 
Iscrete ig th nang that in Theorem I the only thing that has to be 
shall look r set of values ofu. Ina later section of this chapter we 
some discrete variables defined over continuous spaces. 


with a density function 
(or strictly decreasing) 
ith a density function 


The 
or n . 
Ka), and, i Il. Let x be a stochastic variable 
unction rs t u = u(x) be a strictly increasing 
of. Then, wis a stochastic variable w: 


Whe; glu) = felel 
ere x . i 
x(u) is the inverse function to u(x). 


We 
Want . 
ant a function g(u) such that 


prie<usd = E glujdu 


_ Then, in the strictly 


<a <b, and 


for 
wl ever 
ad e<d. Let a = z(c) and b = 2(@) 
8 case, c < u < dis equivalent toa S 


Mer 


pr{e<u<d} = prļla <s <b} 
= [irod 
= [feo du 


TY CHAP. 9 
78 THEORY OF PROBABILITY [ 


ý fie taana Til 
by the familiar rules for change of variables in a definite integral 


: heorem iS 

this case x'(u) > 0 everywhere; so |x’(u)| = x’(u), and the th 

proved for the case of increasing u. i PTT E. 
In the strictly decreasing case, |x’(w)| = —x'(u), so that 4 


i hs 
sign is introduced. This is as it should be, however, ose 
case ¢ Su < d is equivalent to b <x <a; so one set o 
integration must be interchanged. , level) to 

This is as near as we can come (on the first-year calculus z eer 
reducing the continuous case problem to a formula. Desi 
Theorem II is too restricted to be of much use. The following in 
suggestions will cover many important cases. 


5 ʻo variable 35 
Theorem III (Multidimensional Case). If a stochastic var pe sug- 
defined over an event space of several dimensions, there are t 
gested procedures for finding a density function for u: 
. Ler 
(a) Find the set in the event space for which ¿ < u S 
Estimate the probability of this set 


e 
È À grate th 
(b) Find the set in the event space for which u < t. , Inte apply 
joint density function over this set to find its probability, an 


(c') and (d’) of Theore 


fits 
. 16 $ uU 
Theorem IV (Multivalued Case). If a stochastic papeneae 
Theorem IT except that it is strictly increasing over some me n I+ 
the x axis and strictly decreasing over others, apply Theoret 1 


: u, 2 
each branch of the Inverse function. Then, for each value of 4 
the results to get glu). 


sien A 
4 i n x i function t 
Suppose x and y are stochastic variables and wis a a 


wW 
alone. Let us suppose w is strictly : 
once by the applic 
a< u < Bis the y 
itiesa < x <bwh 
pendent, we have 


cos foll 
increasing; other pee 
ation of Theorem IV. Then, the set 10 i 


i regu 
ertical strip in the xy plane defined by Cone jn de" 
rea = ula) and 8 = u(b). So, if xand y 2 


pria<u<Bande<y<q 

=prla<ex< bhprfe < y 
That is [see 
Now, if visa 
see that u 


} = pria <e <vande <y $ dli a) 
S4} = prla < u < plprics 

(c) of Definiti 
function of y 
and v are indep 


de 

4 en 

on 2, Chap. 3], u and y are inden ie ty 
alone, we apply the result just 0 fo Jowin 
endent. Thus, we have proved the 


SEC, 34] MORE ABOUT STOCHASTIC VARIABLES 79 


Theorem V. If x and y are independent stochastic variables and if 
u q u(x) and v = v(y), then w and v are independent stochastic 
variables, 


The examples in a later section of this chapter on normally dis- 
tributed variables will serve to illustrate Theorem III. By way of 
illustration of Theorem IV, we might return to the problem of the 


time of disintegration of the radium atom (see Sec. 14). ; 
In this problem we have a stochastic variable x with a density 
function 


kets for x = 0, 
f@) = if for æ < 0. 
~% Chap. 6 we shall see that an important stochastic variable in con- 
ection with this problem is 


1-1) 


the inverse function is 


1 
ret Vut;E 


With ¢ 
"th the provision that x > 0. The picture: 


x 


Fig. 13. 

i 7 inverse is double-valued for 0 <u S 1/k? and single-valued for 
= Vr, Now 

1 


x(u) = + Safa 


he 
ne . 
ce on each branch of the double-valued section 


1 
Dl = 37a 


"HAP. 5 
80 THEORY OF PROBABILITY [CHAP 
On the top branch 


fle(u)] = ke~#Ve-1, 
On the bottom branch 


- fla(u)] = kéy- 1, 
Therefore, 
0 for u <9, 


k kVu-1 4 eẹ=kVu-1) = $ cosh (Ie vu) 
2 y/u w ev 


(3 


Į 
glu) = for0 < us T 
L 
k giz for u > 7 
2+/u 


35. Translation and Change of Scale 


In many stochastic variable problems it is convenient to translate 
the axes—that is, if we have a stochastic variable 2, to introduce & ne a 
stochastic variable z = x — a. Now, this new variable is merely 4 
function of the old one; so if we have the density (or probability) n 
tion for x, we should be able to use Theorem II (or I) to get the dens! 
(or probability) function for A se f(2) 

For the continuous case, (z) = z + a; so x'(z) = 1. Thus, if I i 
is the density function for x, the density function for z is ge) =f@ a ay 
The same result holds in the discrete case. z = æ means x = ie we 
therefore, by Theorem I, g(a) = prie=at a} = f(a +a). Tha 
g(2) = f(z + a). Let us state this result formally. 


ob- 


Theorem VI. If xis a stochastic variable with a density (Pele 


ability) function f(x) and if z = x — a, then z is a stochastic va" 
with a density (probability) function glz) = f(z + a). 


A transformation of the form z 
the x axis, This, too, is 
we should do well to 


Je 0% 
= kx amounts to a change of ea and 
a useful operation on a stochastic variab ‘sit 
find out what it does to density and proba 


. nd 
functions. Here, for the first time, we see the essential di ont 
between density functions and probability functions 
emphasized i 


hy density and probability sae ne 
“hae Her The crucial point is that the dente the 
aon 1s multiplied by dr before being added (integrated), whi 


Bad: 36) MORE ABOUT STOCHASTIC VARIABLES 81 


| aPaDility function is not. This distinction shows up very clearly in 
he two different formulas for the new function after a change of scale. 


Theorem VII (Discrete Case). If x isa stochastic variable with a 
Probability function f(x) and if z = kx, then 4 js a stochastic variable 


With a probability function g(z) = f(2/k). 


“ T his follows immediately from Theorem I. z= a means © = a/k, 
and Theorem I gives us the desired result. $ 


Theorem VIII (Continuous Case). If x is a stochastic variable with 
2 density function f(x) and if z = kx, then 2 is à stochastic variable 


With a density function g(z) = (1/k)f(e/k)- 
Se inverse function x(2) = 2/h} 80 a'(z) = 1/k, and this result 
ollows from Theorem II. 


36. Representation of Physical Situations 
ne section consists of a brief digression in the form of some exam- 
S of a procedure which is not very practical but which might help 
van trate and emphasize the fact that fundamentally @ stochastic 
* table is a function, The method of Chaps. 2 and 3 is by far the 
i s common in the representation of physical situations. There we 
tig values of the coordinate variable (or variables) represent the 
à ymcal events in a natural way and then chose a density function (or 
i Pability function) that would properly describe the probabilities. 
ow; Suppose we go at it from the point of view of the axioms and the 
“finition of a stochastic variable (Definition 2, Chap. 2). That is, 


y us just pull out of thin air an event space with a distribution func- 
“ Satisfying the axioms. This space may seem to have no particular 
’ ut let us try to construct 


Connecti 2 i k 
on with any physical situation, b t 
eu chastig widen G a functions) over it that will represent the 
rents į CA 

Dis In a physical situation. er, 
int or the remainder of this section, let the event space po 
it rval 0 < x < 1 with the probability of each subinterval defined as 
ion nth. That is, let the coordinate variable v have & density func- 

identically 1 i interv: ro outside. 

y 1 in the interval, zero OWS : K $ 
th Ver this event space let us find stochastic variables which oa 
€ physical situations in the illustrative examples of Chap. 9, In i e 
i Tete case the best way to do this is by inspection, keeping Theorem 
n mind ; wa 


dise 


TY HAP. 5 
82 THEORY OF PROBABILITY {cu 


If u(x) is the total thrown on two dice, we have: 


No © 
T 


YN WwW A oO Oo 


30 35 
36 36 36 36 36 


33 
16 
Fig. 14, 
For the balls drawn from the urn, we have: 
u 
3P —a 
2} N 


Vig. 15. 
To get the continuous e 


ases, we note that 
since f(x) = 1 


according to Theorem II, 
» We have 


glu) = |x’ (u)]. 


Now, if we fix it 50 the signs are right, 


we shall have 
glu) = x'(u). 

tion for wis the derivatiy 

in Chap. 2 and find out w 

tion u we integrate 


That is, the density fune 
tion to u. So, if we look 
we want, to get the fune 


© of the inverse func- 
hat density functions 
and then take the inverse- 


S VARIABLES 
JT STOCHASTIC VA 
Src, 36] MORE ABOUT STO 


83 


m em on eb e of a hemispherical sereen, we 
C , 
prob. on the bombardment « a hemisp 
the probl : t of 


wan l 
' g(u) = cos u; 
therefore a > 
) s ¿di = Sin u, 
x(u) = I, cos t 
and 


u(x) = aresin v. 
The picture: at 


Fig. 16. 


ve want 
In the radium atom problem, w€ 


q(u) = ke"; 


therefore 


and 


The Picture: 


rY HAP. 5 
84 THEORY OF PROBABILITY {cu 


Earlier in this chapter we mentioned the variable [u — (1/ i A 
connection with this problem. From the above representation for u, 
we find this very quickly: 


vo) = (u -iY = Etot — 2) + 1 


This function is shown by the dotted line in Fig. 17. A 
On applying Theorem IV to the stochastic variable v(x), we ge 


av) =1— etkve-1, 


ol = 5 = etv- 
v 


The minus sign in the exponent applies to the branch that goes all es 
way out. The plus sign applies to the branch that goes only to 1/k i 
Since f[x()] = 1, we get g(v) by merely taking the proper values 0 
|z’(v)|. From 0 to 1/k? we must add the two exponentials to get K 
hyperbolic cosine. From 1 /k? to © we have only the negative expo 
nential. This gives us the same result we had before: 


0 for» < 0, 
k 1 
0) = jey Mk V1) ford <u < ip 
k 1 
~- e~*Vv-1 x 
Dale g forv > B 


37. Sums and Products 


The general sub 


ject of functions of 
could well occupy 


i at 
stochastic variables is one tha 
an entire book. T 


he individual values observed in 
ariables, and the statistician A 

ions of these variables. We shall no 
attempt to give a representative discussion of sampling distributions 
umber of problems in mathemati- 


cerned with specific examples of the operations 


ms I to IV, 


S si 
EC. 37] MORE ABOUT STOCHASTIC VARIABLES 85 
the two vari 

e = to be significant. In the stochastic variable repre- 

We aoa a trials we used the sum of n variables. 
ue Dake 3 soke, at any products yet, but the principle is the 
n a ; he details are different. : In the problem of two dice, the 
COTTO s a square array of 36 points. Now, we let u = zy, and 
heorem I) pick out the sets Fa for which u = a: 


‘hess 


R 
1 2 3 
Fic. 18. 


So w 
€ have the probability function 


us J 9 à dpp © 8 
Wu): Me 34s Bo 3o Ko Ho He Me % 

w: 12 15 16 18 20 24 25 30 36 
00): Ke %q We 260 360 360 He 76 He 


B 
t eae of Theorem II or III we could y 
ables = the density function for the product of two continuous vari- 
- Fortunately, most of the work in the theory of probability can 
density functions. It will appear 


f a stochastic variable give us 
e moments can frequently 
the density functions 
The point we want to make 


(if we could integrate every- 


bilit 
af . i 
unctions whether it is convenient to find them or not. So, when 


86 THEORY OF PROBABILITY [cHar. 5 


we speak (as in Chap. 8) of 
n 
pria < RsS 
fes X aso) 
we are just talking about the 
stochastic variable, no m 
probabilities. 


probability of a pair of inequalities on a 
atter what means we use to evaluate such 


There is one mistaken picture of sums and products that the student 
should be warned against. We have seen earlier in this chapter that, 
if we take the unit interval with a constant density function as the 
event space, we can describe a large variety of physical situations by 
means of functions over this space. Now, why not take two functions 
over the unit interval, multiply (or add) them together point by point, 
and call the result the product (or sum) of the two? In the case of the 
two dice, this would give the product the values 1, 4, 9, 16, 25, 36, with 
probabilities 1¢ each—something completely different from what nye 
got above. The fallacy in this interpretation of sum and product 18 
that it overlooks the fundamental idea of Chap. 3 that two or more 
stochastic variables are used to describe compound events and that & 
Joint density function is called for to give probabilities for all these 
compound events. When We put the two stochastic variables (fun¢- 
tions) over the same one-dimensional space, we do not represent all the 
compound events at all. The representation of v + y and ay as the 


ER product, respectively, of two functions over an event space }§ 
his 


Fig. 19, 


SEC. 37] = 
MORE ABOUT STOCHASTIC VARIABLES 87 


but this 
r y 
x 
ulx,y) =x j vlx,y)=y 
y A y 
x 
why yJ=xty s zľx,y)= xy 
Fic. 20. 


If, j 
> M the above picture of the variable xy, we impose the restriction 
ted line. Thus, if the 


Y= 
Variables get the parabola indicated by the dot h 
hree- dir are necessarily equal, the product xy reduces to x; and the 
mensional picture may be replaced by a two-dimensional one. 
dentical stochastic 


The 

S . 
Varia) tudent should note the distinction between 7 
i Two stochastic variables are 


bles ; 
les and equal stochastic variables. 
the same, but this involves 


Mention) = 
o fey if their marginal distributions are 
ations cular dependence of one on the other. Indeed, in many appli- 
ient have use for identical variables which are completely inde- 
joing di i Equal variables, on the other hand, have all points of their 

m a ribution on the line y = v. , 

Present, se in point, let us consider variables x and y, each of which 
es q2 s the result of throwing a single die. Here each of the varia- 
Values Me y? is the product of equal variables and so assumes the 
> 4, 9, 16, 25, 36, with probability 1g each. The variable zy, 


88 THEORY OF PROBABILITY [cuap. 5 
on the other hand, is the product of identical but independent variables 
and has the probability function given at the start of this section. 
38. Convolutions 


Let x and y be independent stochastic var 


iables with density func- 
tions f(x) and g(y), respectively. Then 


(zy) = f(x)g(y) 
is their joint density function. 
for x; that is, 


Let F(x) be the distribution function 
prix <t} = FQ. 
Then, by (d) of Theorem IV, Chap. 2, 
F(x) = f(x). 


the density function for the variable +Y i 
et of points for which x +y < wis the lower 
n the diagram). 


Now, we want to find 
First, we note that the s 
left half plane (shaded i 


Yj y 
SoN 
Se 
x 
Fic. 21, 
Therefore, the distribution funetion 


H(u) = priz + Y <u} 
g(x,y) over this shaded area: 


aoa e aiy i 


is the integral of 


[Fe ~ notin. 
quired density fun 
erentiate the above equ 


The re ; 
diff ctio; 


; e 
n h(u) is the derivative of H; s0 W 
ati 


on with respect to u. This calls fo" 


Suc. 38] MORE ABOUT STOCHASTIC VARIABLES 89 


differentiation under the integral sign—a step that requires justifica- 
TP However, we shall skip that part of the proof and proceed to 
ifferentiate. This gives us 


hlu) = pis f(u — y)glu)dy. 


Clearly, we could have integrated in the other order and obtained 


h(u) = N glu — x)f(x)dz. 


The two (necessarily equal) integrals obtained here define what is 
called the convolution of f and g. Frequently this result is written 
Symbolically 


h=f*g. 


E i 
hese results may be summed up as follows: 


Theorem IX. If{x and y are independent stochastic variables with 
*nsity functions f and g, respectively, then the stochastic variable 
~ + y has the density function 


Mu) = Fu) sgi) = f? flu — Dowdy = J-a ou = ade. 


u 


n 
u "i 4, 9 - 
For example, suppose a and y are chosen independently and at ran 


om between 0 and 1. Then, 
: 1 ford<«#<l 
f(z) = | 0 otherwise, 
a fe! for0<y<Sl, 
T gy) = 0 otherwise. 
herefore, 
1 foru-1SyS% 
- fu-y= 0 otherwise. 
0 
pond) < w <4, 
1 ford0<yS%& 
it fu — y)gy) = 0 otherwise; 
l <u =o. x 
1 foru-l<ySl 
fu — ygu) = { 0 otherwise. 


Foy : 
Or all other values of u, flu — yg) is identically zero. Therefore, 


r HAP. 5 
90 THEORY OF PROBABILITY [c 


fél =u itsus h 
0 
fu) + gu) = [Ltwa2-u fori <u < 2, 
aiii . 
0 otherwise. 
The picture: 
hlu) 
/ 
7 2 od 
Fira. 22. 


; ik von- 
For certain discrete cases we get something that looks like a naD 
volution directly from Theorem I. According to that theorem, 


yi rly, we 
is the sum of all values of S(x)g(y) for which x ty =u. Clearly se 
can organize such a summation by setting x = u — y and summing 
y; that is, 


h(u) = > su = ng), 


the summation taken over 


all pertinent, values of y. 
For example, suppose 


f(x) = *Cp7q"-* (a 


a n) 
IY) = *Cypigv 


=0,1, 2: i 
Q02 oe 

Of course, this means 

n Bernoullian trials; 


2n trials, and 


2)» 
2 Cpu (u w eet (Om ara 2n) 
methods of this chapter. Here 


h) = ic = vgly) 
v 
= > "Cup ugnu "Cypr 
v 
= itap 
Y 


= pig? 2 "Cyc 
uv 


uy 


SEC. 39] y 
MORE ABOUT STOCHASTIC VARIABLES 91 


For u , 
ihe phe he the summation is from 0 to t; and Prob. 23, Chap. 1, gives 
mena PANNE erie For u > n, the summation is from u — n to 
substitutions z = = = * 
evden itutions z = n — y and v = 2n — u reduce this to the 


39. i 
Normal Distributions 
The so-c: 
a BG : alled normal, or Gaussian, distribution is the one determined 
ensity function of the form 


If we E 
make the translation « = z — a, we find from Theorem VI that 


he 3 
density function for x is 


f(z) =g@+a= 


errr 


o Von 


Tho 

u a fi i š 

Ough the first form given above is usually regarded as the general 
assume that the 


definit: 
ie z the normal distribution, we shall always l 
tribution H has been made. Thus, when we speak of a normal dis- 
ance of the shall mean the simpler form in which a = 0. The signifi- 
is entire constant e we shall see in Chap. 6. The importance of 
"or the acy function in probability theory will appear in Chap. 8. 
Normal] Fae we want to study an important special property of 
ideas j Y distributed stochastic variables that illustrates some of the 
1n this chapter. 


efore do; 
enit doing that, however, we should show th 
Yy function for the normal distribution really is one, 


[7 teas = 1. 


at what we have called 
i. e., that 


The ; 
© indefinite integral 


t 1 i : 

BQ) = a 21/20? dap 

ike ai Jementary operations. 
haga, ne functions and logarithms, its values have been 
Moment extensively, but this does not do us much good at the 
- The following trick is ordinarily used to evaluate the inte- 


8ral 
oO > 
Ver the entire x axis: 


° 1 2 1 2 =s A 
ani = 13726 d -ne q 
fi 20 \/Oq et ae) f e dx f e Yy 


Qo? J _ 2 =ë 
seha | į f ecne de dy 
wmo? J- J-= 


is a“ 
ao Dew? : ; 3 
i function, not expressible in terms ofe 


4 AP. 5 
92 THEORY OF PROBABILITY [cu 


t 
Now, we change to polar coordinates and remember that the elemen 
Now, 

of area dx dy becomes r dr dé. 


o 1 gaia ja (3 = 1 ai a! emro z dr d0 
— o 0€ y 2 2% Jo Jo g? 
i} 27 


o 1 [2 i 
=— do = L. 
uif 


Theorem X. Any linear combination of two independent normally 


distributed stochastic variables is itself a normally distributed sto- 
chastic variable. 


EN — ete 


2r Jo 


Let x and y be independent stochastic variables with density 
functions 


: a and 1 
T1 Vr 


EI 
o2 V 2r 
respectively, and let u = ax + by. 

by Theorem VIII the density functi 


l and l 
æ V r B 7V2 


T 


- then 
Now, letz = ax and w = by; th 
ons for z and w are 


g'a? 


iia 
emen, 


respectively, where a = do, and 8 = 


re 
boz. By Theorem V, z and w a 
independent, and u = 


z + w; so Theorem IX applies, and 
= y 
h(u) = f A E(u)? /2atg—w2/282 ayy 


= i's K exp (- Bu? — Buw + Bw? + i) dhe. 
—» 2raB Fa? 


On completing the square in the exponent, we have 
A z 2 2 2 2 2 

Hai =e yp | — be — Bu w zje 
=a l Zap? (e F A ae +B) 

= 1 u? ° Va + B 

= aml 

VER |~ areas] [EEE 
+ 2? 


2 
a? Bu ) | dw; 
| cae e- aF], 
s £ y É 
and this last integral is equal to 1; therefore the density function foru! 
h(u) = ea a eee py, 
Zala? + 8) 


Suc. 39] MORE ABOUT STOCHASTIC VARIABLES 93 
Thus, v is a normally distributed stochastic variable with 


o = Ve +B = Veo o 


Y By applying this theorem over and over again, we obtain the follow- 
Mg result: 


Corollary 1. If xy, x2, . . - , tn are independent stochastic varia- 
bles with density functions 
fila) = Laem, 


Oi 


3 


then 


n 
u= > Aiti 
i i=l 


pa stochastic variable with a density function 


Mu) E l A N 
Varla t a+ °° Ha P ] 
ear 
exp| — aatto + a2) 
Where aj = ae 
* te bution i lane, 
E the case of the so-called circular normal distribution in the plane, 
© Joint density function is 
l itune 
ely) = 753° a 
l r20 
— mo? a f 
Where p Va? F y?. Now, the fact that the joint donit SAE 8 
HS function of r alone does not make it the density function ? 


; is ll 
R atter of fact surprising as it may seem, in this case ” ip A M 
Ebtted, Let us apply (@) of Theorem ML OO canbe 
equalities t <p < { ya define a ring, and it is this ring W ae Se 
*oility we want. The probability of this ring 18 approximately 


alle z e tPA 
am ly 2rt dt; so 
Wh i F e ximately 2rt at; 
re A is the area of the ring. Now, A is app™° 


t —t2/207 dt, 
pr{t<r<ttdaj~p? i 


94 THEORY OF PROBABILITY [cnar. 5 


and (d) of Theorem V, Chap. II, tells us that r has the density function 
= L gra 
I= ze k 


REFERENCES FOR FURTHER STUDY 


Halmos, “The Foundations of Probability,” Am. Math. Monthly, vol. 
51, pp. 493-510, 1944. 

Kolmogoroff, Foundations of the Theory of Probability. 

Uspensky, Introduction to Mathematical Probability, Chaps. XIII, XVI. 


PROBLEMS 


1. Let x have a density function identically 1 from0to1. Describe 
the stochastic variable of Prob. 30, Chap. 2, as a function of x. 
Ans. u(x) = sin r(x — 3). 
Describe a variable with Cauchy’s dis- 
as a function of x. 


Ans. u(x) = tan a(x — M4) 
3. Find the probability function for the sum of the variables in 
Prob. 30, Chap. 3. Ans. ur 2 3 4 5 6 
hu): a Ige 354a M H2 
function for the product of these variables: 
Ans. us i 2 3 4 6 
Mu): Ma Wo alka Wg 16 $42 
function for the sum of the variables 1™ 
Ans. w 3 8 £ 5 : 
hu): Mo 16 26 14 Mo 
6. Find the probability function for the product of these variables: 
Ans. u: 1 2 3 4 6 


2. Let x be as in Prob. 1. 
tribution (Prob. 31, Chap. 2) 


4. Find the probability 


5. Find the probability 
Prob. 31, Chap. 3. 


Wo 
N h(u): Yo kyyn 
T. Find the probability function for the sum ofthe squares of thes? 
variables: 


Ans. w 2 5 g 9g 18 


, ff 
i hlu): Ko K y y y no 
in the unit square 0 


Find the density function fora + y. 
Ans. 2 


u? forO<us Jy 
hlu) =} 24 — u? forl<us” 


0 otherwise- sty 
function in Prob. 8, find the dens! 


8. Let o(ey) =a 4 y 
with g = 0 elsewhere, 


9. Using the 
function for zy. 


Ans. h (u) 


joint density 


52- Qu ford < x < 1, h = 0 otherwis® 


diet.) Met 
'Stributed ai i i 
vith ¢ = 1. Find the density 


MORE ABOUT STOCHASTIC VARIABLES 95 


10. Let x i 
= A x and y be chosen independently and at random between 0 
. ind the density function for xy. 

1. Let x bys ee log(1/u) for 0 < u < 1, h = 0 otherwise. 
a Rid yl ae in Prob. 10. Let u = max. (x,y). Find the 
12. Vor tt Ans. h(u) = 2u for 0 < u < 1, h = 0 otherwise. 
he same x and y, find the density function forv = min. (x,y). 
18, Bina et A ns. kv) =2—QWford<v<S1Lk=0 otherwise. 
It and 16 ~ joint density function for the variables w and v of Probs. 
; 7 Note that x and y are independent, but u and v are not. 
14. Let el glup) = 2for0 Sv <u 0 < u 1, p = 0 otherwise. 

x and y be independent and have density functions 


J) = l agar for x > Ay t 
lespective] , : 5 ias 0 ay) = if ee : 0 
y. Find the density function for x + Y. 
of (eze — e) for u > 0, h = O for u <0. 
Find the 


Ans. h(u) = 
16. Tf a B 
densit E = $, the answer to Prob. 14 is meaningless. 
Yy function for w in this case. 
16, Ans. h(u) = aue 
cite x and y be independent, and le 
ion (Prob. 31, Chap. 2). Find t 


-au for u > 0, h = 0 for ù < 0. 
t each of them have Cauchy’s 
he density function for x + Y- 


Ans. hlu) = 2/r(1 + 4u?). 


di 


17. Let 
distribute x and y be independent, and let each of them have Poisson’s 
w 1on (Prob. 23, Chap. 2). Find the probability function for 
Ans. h(u) = (2a)*e-22/!. 


H ‘Hist. 
Hint: See Prob. 21, Chap. 1. 


18. q, 
. Let x be chosen at random between —1 and 1. Find the density 


function for z? 
19. Let Ans. h(u) = 1/2 Vu for 0 <u < 1, h = 0 otherwise. 
t x have the density function f(z) =e for x > 0 with f = 0 


for 
we à 
0. Find the density function for 2. 
Ans. h(u) = (1/2 suje- Y" for u > 0, h = 0 otherwise. 


20. © 
Oe. Compute the first two derivatives of the density function 
anq RE Tind its maxima, minima, and inflection points, 
3 a graph. ‘Ans. Max. at 0, no min., infl. pts. at +o. 
densig. et x be a normally distributed variable with c = 1. Find the 
Y function for 22. 
foru>0,h=0 otherwise. 


Ans. h(u) = (1/V rue? 


t x and y be independent, and let each of them be normally 
function for 2? + y’. 


96 THEORY OF PROBABILITY [cHap. 5 


Hint: The density function for ~/x? + y? is found in the text (Sec. 
39). Ans. h(u) = Me” for u > 0, h = 0 otherwise. 
23. Show from Probs. 21 and 22 that 


“dx Bo 

o squ =x) “ 
Hint: Compare the convolution for Prob. 22 with the answer obtained 
by other methods. 


24. If x is normally distributed, find the density function for e”. 
Ans. h(u) = 1 g= 02 for u > 0, h = 0 otherwise- 
ou s/m 

25. The magnitude of a cert 
height, weight, and velocity) is n 

of a certain large population. 
this population at random and 
member chosen. 


ain physical characteristic (such as 
ormally distributed over the members 
An experimenter chooses members of 
measures this characteristic for each 
His measuring process is susceptible to error, and 
the probability distribution for the errors is also normal and inde- 
pendent of the magnitude he is measuring. Show that there is a nor- 


mal probability distribution for the result of a given measurement. 
Hint: Apply Theorem X. 


26. Show that, if there is a normal probability distribution for the 


results (note that this is not quite the same as to say that the aggregate 
of results form a normal frequency distribution) and if the law of erro! 
is normal and independent of the magnitude being measured, then 
there 1s a normal probability distribution for the true measurement ° 
an item chosen at random from the population. 

27. Show that if 25 0a, » e + 5M are the Values of n independent 


samples drawn from a normally distri A the sam 
ple mean y distributed population, then 


n 


eae! 
tome) my 


isa normally distributed variable, z 


CHAPTER 6 
MOMENTS 


N 
i eed of stochastic variables would be complete without a dis- 
ble eS peer Not only are the moments of a stochastic varia- 
necessary a tools for the statistician, but some knowledge of them is 
Tih areth an understanding of the limit theorems (see Chap. 8), 
The stude T of the calculus of probabilities itself. 
nection witt nt has probably first heard of the term “moment” in con- 
caleulus oo wc law of levers in elementary physics. Most first-year 
the HE ses attempt some further discussion of the idea, but still 
harily used oey of physics (centroid, moment of inertia, etc.) is ordi- 
something ; he student may therefore get the idea that a moment is 
M To the Sera ng connected with rotating rigid bodies. 
a certain t R hematician moment merely means a Stieltjes integral of 
eCäse i Pe So in our discussion it means either integral or sum as 
ples of A nay be—see See, 15. The moments of the physicist are exam- 
his type of integral. So are those of the statistician, and it is 


these 
latter that we shall be interested in here. 


40. M 
a oments of a Stochastic Variable 
it iceltiesi iti 
fire he Stieltjes integrals, we need separate definitions of moments 
iscrete and continuous cases. 


Definiti 
finition 1. The kth moment of the stochastic variable x about 


e poj f 
Point a is defined as follows: 


(2) Discrete Case 
(x — a} f@), 
(b) Continuous Case me 
pi (æ — a} f(a)de. 
s an improper integral, 
rgent. Otherwise, we 


If (a) 
thi ) involves an infinite series or if (b) involve 


ig ger: 
Y1 ; 
es or integral must be absolutely conve: 


Sayot 
e moment is not defined. 
97 


a CHAP. 6 
98 THEORY OF PROBABILITY [ 


iti sony is not, of course, automati- 
i oe eo en SSR wath the formal ber ecg a 
opt roe so we shall make the blanket hypothesis that, 
vi nts menti s 
ee nthe rl e ea pointed out in Chap. 3, the aaa 
ra Bie may be changed in an absolutely one gontont el 
gral; so we shall continue to do this as a matter of E hi ve 
To certain moments of stochastic variables the statis an ka 
special names and attaches special symbols. The first a ney 
zero is called the mean, or expectation, and is denoted by & 
whichever symbol seems more convenient. In symbols: 


rb anvertis 
oned in subsequent theore: 


E(x) = = hee a f(x)dr. 


i dispersion. 
The second moment about the mean is called the variance, or dispe 
(We shall use the f 


rar (a) oF 
ormer term exclusively.) It is denoted by var(z) 0% 
frequently, by oè. In symbols: 


var(r) = g? = J, (x — ¥)°f(x)de. 
The positive gq 


{uare root of the variance is ¢ 
As one might 


jation- 
alled the standard deviatio 
Suspect, it is denoted by o. 


41. Example—Normal Distributions 
First, we mi 


ght take a look 
variable, 


eet buted 
at the moments of a normally distribu 
Accordingly, let 
1 
z) = —_ (ag oo 
K o y 2r 
Now, 


1 % 
Bj a E E Ot nn ee E a 
o Va ii aa Ae t 


So the interesting 
M, the nth mome: 


moments are 


te bY 
those about zero. Let us deno 
nt about zero, 


To find M n in general, we take 


L” e-22/2¢2 dx 
and factor it as 


(x1) (xee: dx). 


SEC. 42] 
MOMENTS 99 


Then, integrati 
, integrating by parts, we have the recursion formula: 


Mu, =—! fi - prindi 


oV2r 
= — L pity i ie a(n —1) [° pagere dy 
V 2r sa VI J-e 
o(n — 2 
= m= 1) qre tne de 


væ Ja 


= o(n — 1)M,-2 


ee from Sec. 39 that Mo = 1, 

Beneni 8 it follows that all the odd moments are ze 

2 wea Sere os 3c‘, 150°, . . . with the moment of 

i X5X7+++ (Qn —1)o™. 

Spe Particular, we see that var(x) = 9°} 50 the constant o, which 

A sred merely as a parameter in our discussion of normal distribu- 
s in Sec. 39, is actually the standard deviation of the variable v. 


and we have just seen that Mı=0. 
ro, while the even 


order 2n given by 


e Example—Cauchy’s Distribution 
me be fallacious to argue in the > exa the o 
respect R ara -zero because the density function 1s symmetric with 
quadr to 0; Tt is true that for an odd moment this gives a first 
if the ant “area” and a congruent third quadrant “area. Therefore, 
not oe sara exists, it must be zero; but a doubly improper integral is 
are bsolutely convergent unless it is finite at each end. If both the 
S mentioned above are infinite, We say the moment does not exist. 
x ie an example of this sort of thing, consider the stochastic variable 
ith density function 


above example that the odd 


1 
IO = 5+ ey 


his is usually referred to as Cauchy’s distribution (see Prob. 31, Chap. 


Note that 
i Iiz T 
LA ae EEE N E 
o F E ( 3) j 


T 


2 i ; | 
Í » r(1 F x3) dx = zarctan a 


as r ; 
e 
quired. However, 


= t 1 2 
_ 2% _ gy ==log (1+7? ) 

| ais” = log ( 
ch end. So we say the first 


defined. 


, 


-o 


and thie : 
“Ades integral tends to infinity on ea 
ent of this stochastic variable is not 


AP. Ô 
100 THEORY OF PROBABILITY [cu 


43. Example—The Radium Atom 


š or t f 
We have seen (Sec. 14) that, if x is the time of disintegration o 
a given atom of radium, then z is a stochastic variable with 


kez fora > 0, 
fe) = | 0 for x < 0. 


So in this case we have 

E(x) = hy krez dy 
‘6s Xi 
rF $ e~z dx 


kz 
=e 
k 


= — getz 


I 


Our interest 
fact that 


° N 
var(z) = Í r(e -= 1) ez dy 


(in Chap. 5) in the variable [x — (1/k)]? stems from the 


0 
1_ f2 ° j 
ee [aa 
“RK kT Te o 
sd _ 2 2 
“BO Bt SE 
a 
ae 
It is worth Noting here that, while z = 1/k, 
1 1/k k 
pr f <i} = ji kete dy = et i =l]—-1= 63+. 
y ; 
This should 


a 
uous ca: 7 ‘ pee, 
ieee 2 Se (such as we have here) the n 

the case of the radi 


SEC. 44] MOMENTS 101 


is ke: dx =1—e# = 1; 
ets = 4; 
log 2 
k 


= £ log 2. 


a= 


44. Expectations 


We saw in Chap. 5 that, if æ is a stochastic variable, then in general 
u = u(x) is also a stochastic variable. So u has an expectation given 
by Definition 1: 
E(u) = de ug (udu, 
Where g(u) is the density function for u. We spent quite some ne z 
hap. 5 finding these density functions from the original density func- 


tion f(x), However, this sometimes turns out to be quite a chore, and 
One of the nicest things about expectations is that we do not have to 
find g(u) in order to find E(u). This observation holds for functions of 
Several stochastic variables as well as for functions of one. A general 


Statement might be given as follows: 


Theor em I. If the stochastic variables tı, ta, - + + sVn os pb 
ensity function (joint probability function) oltita ++ + = oe 
= ultra... ,t) is any stochastic variable defined over the ev 


Space of the 2’s, then 


E(u) oy we ie er bie ulto «+ stn) t1,%2) ++ - p 
L Se s: or AT 


u 


[2w = a) en Ý wees, wee ta) e(Rt2 + + - it) | 


Ti z: tn 


` tu roof 
Important as this theorem is, we shall not try to give a general pro 


Fundamentally it involves the relationship pape: rr 
RESTEN Se : V, Chap. 5, we go 
Pw ee me n ne in an elementary 


much inf i + 4 ye cal 

; ormation on this relationship as We f 

“ourse. Our treatment of Theorem I will be to show w w etd 
ea « Pasis of these four statements in ac 3 er Tis true in 
i ar in mind that, while our proof is incomplete, theo 

Seneral, , 

First, let us take the discrete case. By definition 

B) = Y wae). 


u 


102 THEORY OF PROBABILITY [cnap. 6 
By Theorem I, Chap. 5, 


g(a) = priu =a} = X (71,02, - . o yta) 


=a 


E(u) = x: X olEnte tn). 


This is the sum of all terms ug, grouped according to the values of u. 
Now, the sum called for in Theorem I, 


ya TPE X ulesen + e gUn)oO(eiyte, o a Ta) 
T r: 


In 


So 


is the sum of exactly the same terms; only this time they are grouped 
according to the values of the z’s. Therefore, one sum is merely à 
rearrangement of the other. In the case of a finite sum, these are 
obviously equal. In the case of an infinite series, the requirement of 
absolute convergence guarantees that the rearr 
does not change the sum. 

In the simplest continuous case (th 
Chap. 5), the proof of Theorem I amount: 
ble. First, let us recall that in this case 
single-valued. Furthermore, «"(w) 
lz'(u)| = +2(u)—the 
now, from the definiti 
Chap. 5, we have 


angement of the terms 


at covered by Theorem II, 
s to a simple change of varia- 
both u(x) and its inverse are 
has the same sign for all w’s; that is, 
same sign applying everywhere. Starting, 
on of expectation and applying Theorem II, 


E(u) = i ug(u)du 
JE eNe du 
= [s uffe (uje (u)du, 


the limits of integration bein 


ll 


g reversed if g’ is negative and remaining 
the same if 2’ is positive. Now, the substitution u = u(x) either 


reverses the limits of integration again or leaves them alone, according 
to the same criteria; therefore 


[O eoe ldu = JT uode. 

Let us now abandon the idea 
examples that illustrate that i 
Theorems III and IV, Chap. 5 


of proving Theorem I and turn to some 
t really does work in cases covered by 
- We saw in See. 39 that, if 


1 
e(zx,y) = = erty /2 


SEC. 44] MOMENTS 103 


and r = \/a Ff, then 


Therefore, by definition 
E(r) = iy re??? dr. 


f@) =ne 


By Theorem I 
E(r) = / f = Je pet de dy. 


To reduce the latter to the former, all we have to do is change to polar 
coordinates: 
oe 2 1 aj 
= — re? r dr dé 
E(r) a Í, z” 


Peons 
=h h 
-f7 r? e? dr. 

o 


To illustrate the cases covered by Theorem IV, Chap. 5, let us take a 


very simple example. Let 
lg for -1 $a <1, 
J) = r otherwise, 
and let 
By Theorem IV, Chap. 5, 
1 


1 Lo oet 
a(t) =2 F gu avu 


Note that we doubled it to take care of both 


efinition, = 
, u loal 
E(u) = Í, Vidu = ga [ 3 


1 
a A tg pt 
= 2g dx = Zu 
E(u) [ie 5 


s 
, The formulas in Theorem I are often given as à definition of expecta- 
because Theorem I gives the 


i 
f on. This is certainly the easy way out, ‘ A 
orm from which expectations are ordinarily computed; and using it as 


efinition seemingly eliminates the necessity for all explanations. 
Owever, without Theorem I, such a definition is ambiguous. From 
© considerations in Chap. 5, we see that a given stochastic variable 


branches.) So, by 


By Theorem Î, 
1 


=1 


3 HAP. 6 
104 THEORY OF PROBABILITY [CHAP 


can be represented in many different ways as a function over first a 
event space and then another. From which representation is A 
expectation defined? According to Theorem I, they all lead to the 
same thing. 


There are four immediate corollaries of Theorem I that are worth 
noting. If uis a stochastic variable and a is a constant, 


Corollary 1 
E(u + a) = E(u) + a. 
Corollary 2 
E(au) = aE(u). 
Corollary 3 
var(u) = E[(u — ā)3. 
Corollary 4 
E(u — ï) =0. 
By setting up the expect 
chastic variables in the for: 


two theorems which are i 
usual, we shall base our p 


ations of the sum and product of two sto- 
m given by Theorem I, we obtain very easily 
nvaluable in any work with moments. ? a 

roof on the assumption that there is a join 
density function. We should point out, however, that Theorems 
and III hold for any joint distribution for which the required expecta- 
tions are defined. 


Theorem II. If z and y are any two stochastic variables, 


Bau + y) = E(x) + Ey). 
This proof is very simple, 


Me by) = S [G+ delegate dy 

= Tn a hes p(z,y)dy dz + jS y la o(x,y)dx dy 
= fT Sode f- vody 

= E(x) + Ey). 


Repeated application of Theorem ITI, along with Corollary 2 to 
Theorem I, gives 


SEC, 45] MOMENTS 105 


Corollary 1. If 21, £a... 42» are stochastic variables and 
n G2, . . . , an are constants, 


E (Š ats) = h a;E(x;). 


Theorem III. If x and y are independent stochastic variables, 
E(ay) = E(x)E(y). 


The proof of this is even simpler. If x and y are independent, 
e(z) = f(x)g(y); so 


Ey) = f”, [7 oeud dy 


= hk af(x)dx | ieee yg(y)dy 
= E(x)E(y). 

A similar theorem holds for variables, provided that they are 
totally independent. The student should note that this is a much 
Stronger restriction than a requirement that every pair of them be 
dependent, 
tis very important to note that the addition theorem (Theorem II) 
ds for all pairs of stochastic variables, while in the multiplication 
(Theorem III) we assumed independence of x and y. As we 

“See later in this chapter, independence is not necessary for the 
at lication theorem to hold; but it very definitely does not hold for 

Pairs of variables. 


$5. Example—Balls in an Urn 

mie are 3 balls in an urn, numbered 1, 2, 3. We draw 2 balls and 

Ptod eir numbers, then consider the expectation of the sum and 

Seay of the numbers noted. If we return the first ball before the 

tdt ee the variables are independent; if we hold it out, they are 

functi et us consider both cases. The joint and marginal probability 
Ons are easily found in each case. They are as follows: 


hol 
the 


Tndependent case: 


y gly) 
35 4 K K 
2 1 ⁄ K 9 
1 1g %5 K K 
K K R æð 
1 2 3 $ 


d TAP. 6 
106 THEORY OF PROBABILITY [cr 


Dependent case: 


yg) 

3 4% } 
2 4% } 
1 x 


es Oo LOH OD. 


wer ok 


Je) 
x 
Sa pi ase 
All the marginal probability functions are the same; so in each cas 
Ba) = By) = 4 +243) = 2 


TA = ilit: 7 
Let us find the expectations of «x+y and zy from their eo A y 
functions. To get these, we draw the level lines for x + y and xy: 


u=r+y r = xry 
we No 
5 6 
4 


4 
> 3 ` 2 3 
Fig, 23. 
According to Theorem I, Ch 
are obtained by adding the 
have: 


. i nd v 
ap. 5, the probability functions for we 
values of g(x,y) along these lines. 


Independent case: 


z 3 4 5 6 

Su: 4 2% 3% 2% 

Blu) = 42 +6 +12 -£ 10 + 6) = 396 = 4 
u 12 3 4 6 9 

9%): 3 23 25 1g x% 16 

BQ) = W1L+4+6+44+4 1249) = 364 =4 


6 
6 % % O 
Eu) = K0 +6 £8 + 10 +0) = 24 = 4 
u 12 3469 
90): 0 36 % 0 % o 
BO) = 4O+44+6404 1240) = 2% 


SEC. 46] MOMENTS 107 


Perhaps the educational thing about this example is not that the 
multiplication theorem fails in the dependent case but that the addition 
theorem holds in both cases. The student should note that this is not 
mere coincidence. It is a direct consequence of Theorem II. 


46. Example—The Matching Problem 

To illustrate the real power of Theorem II, let us consider a more 
complicated example. Tn Sec. 33 we saw that, if n numbered balis are 
Placed in n numbered pockets, the probability that r of the numbers on 
the balls match the numbers on the pockets is 


n-r 


| Hayr 
rl s 


s=0 


By definition, the expectation of the number of matchings is 


A direct computation of this would be rather messy. However, the 
otal number of matchings is the sum of the number of matchings for 
he individual balls. Now, each ball matches either once or not at all, 
and (with no knowledge of what the other balls are doing) we sce that the 
Probability of matching for each ball is 1/n. Letting v; (i = 1, 2, 

`» 2) be a stochastic variable assuming the values 0 and 1 with 
Probability 1 — (1/n) and 1/n, respectively, we have 


1 
E(x:) = n 
Ge 1,2, , n) and 
r= Ti, 
È 
Whence 
g 1 
E(r) = Y Be) SH Pe ia dy 
izi 
The V 


ariables a; are obviously dependent here, but that does not 

the validity of our application of Theorem II. 

a very interesting observation suggested by the above analysis of 
matching problem concerns the experiments to test for extrasensory 


affect 


108 THEORY OF PROBABILITY [cHar. 6 


perception. Cards numbered from 1 to n are shuffled and then S: 
up one at a time. The supposedly gifted subject is placed so aie ee 
cannot see the cards, and as they are turned, he attempts to call o np 
numbers. Suppose a card is discarded after being turned so that, in : 
given run of n trials, each number appears once and only once. oe 
pose, further, that the subject knows this and tries to use cleverness : 
well as extrasensory perception by calling each number once and Zi 
once. Then, except for E.S.P., we have exactly the matching pro 
lem; and the expectation of the number of correct calls is 1. i 
However, the fascinating thing about this problem is that we get t a 
same expectation no matter how the cards are handled and no matte 
how the subject organizes his calls. The number of correct calls R n 
trials is the sum of the number of correct calls on the individual tria x 
Therefore, Theorem II applies; and so long as the probability of g 
is 1/n on each trial, the expectation is exactly 1 on n trials no matte 
what the nature and degree of the dependence of the trials may se 
Note, further, that n - (1/n) = 1 for all ws; so it does not even matte 


how many cards are used! 


47. Variance 


i i t, 

The first thing we might note about the quantity var(x) is oe 

according to Corollary 3 to Theorem I, it is the expectation of a i t 
chastic variable. Therefore, Theorem I applies to give us the res 


ot A iable 
that, if x has a density function f(x) and u(x) is a stochastic yariabl 
over the x axis, then 


var(u) = i (u(x) — a2f(x)de. 


For example, in the radium atom problem 


Fu) 


{ ketu for u > 0, 
0 for u < 0. 


var(u) = Í (u -- i) od 


«tie 
(see Sec. 43). However, in Chap. 5, we characterized the variable 


as the function 
ula) = — E log(1 — 2) 


P re 
over the unit interval with constant density function. We noted the 


SEC. 47] MOMENTS 109 
that, 


(x — 3) = pllog( — 2) + 1}; 
therefore 


1 
var(u) = Í J llog(t — x) + 1} dx. 


Finally » We saw that the variable 


has a density function 


0 for v < 0, 
k 1 
cosh(k ~/v) for 0 <v < 7y 
9) = (ev À k 
k y-kvi-1 for v < 


2 /v k? 


Therefore, 
7e h = kv : 
Var(u) = Bi) = I, EV? cosh(k odv + al EAE en avint a 
o , 


Of these expressions for var(u), the first is obviously the simplest. We 
merely note the other two by way of pointing out that the points of 
View suggested by them are perfectly legitimate. 


he characterization of var(x) as 
El(x = a; 
together with the addition and multiplication theorems for expecta- 
ions, leads us to a number of interesting results. 


Theorem Iy, If x isa stochastic variable and a is a constant, off 
var(x) = E(x") — 7. 5 
(c) var(x + a) = var(z). Or 
var(ax) = a? var(z). ks 


i hese results follow immediately from the addition theorem and 
e corollaries to Theorem I. To prove (a): 

var(z) = E[(x — 2)?] = E(x? — 2x3 + ī°) 
= E(x?) — 2E(2) + P = Ele?) = 28 + a ; 
= E(x) — 7. N 


AP. 6 
110 THEORY OF PROBABILITY [CHAP. 


For (b), E(x + a) = = + a; therefore 


var(t + a) = Ela pa-r- a)*] = El(x — )2] = var(z). 
For (c), E(ax) = az; therefore 
var(ax) = E[(ax — az)*] = Eja? (x — Z)?] 
= @E(x — 3)°] = a? var(x). 
The student would do well to remember (b) and (e) in a tt 
They say that a translation does not affect the variance, while a chang 


> of scale 
of scale multiplies the variance by the square of the change of sea 
factor. 


à stie-varinbles: 
Theorem V, Teri) ae; zas iate independent stochastic variables, 
i 
n n 
y \ 
var (X a) = pi var (x). 
i=) im 


Let 2; = 2; = Z Then, by Corollary 4 of Theorem I, 3, = 0; 80, by 
Theorem Il, 
E(3z;) = 0. 
Furthermore, by (b) of Theorem IV, 


var(z;) = var(z;), 

and since Sz, = Elt: — z) = Ya, = DZ, 
var(z,;) = var(Za2;,). 
Therefore, it suffices to 


ee 
Prove that var(Zz,) = X var(z;). Sin 
H(z) = 0, 


(Za) = PEY] 
=E O z + D 283) 


i ixj 
= X Ee) Ba > Elez) 
i ij 
= > var(e,) + > Elez). 
Now, E 


i — II 
since the 2’s are independent, the 2’s are too; so Theorem 


SEC. 47] MOMENTS 111 


applies, and 
E(z:z;) = E(z)E(z) = 0 


for i = j, This gives us the result we want. f o 

The variance of a stochastic variable is supposed to give some indica- 
tion of the “spread” of the values of the variable. An indication of 
the way in which it does this is given by the following: 


Theorem VI (Tshebysheff’s Inequality). If xis any stochastic variable 
and ¢ is its Standard deviation, then 
1 
pr {|x — &| > to} < E 
To simplify matters, let z = x — ž. This does not alter the value 
ot % Now, we have 


o = i z*f(z)dz > fa m 2° (z)dz 
> (ta) *f(z)dz 


= fle >t 
= o*pr{|z| > te}; 


and, dividing by oe", we have 


Por each t >] there is a stochastic variable for which the equality 
mh Z 4, sae a e aa 
m Tshebysheffs inequality holds. Consider, for instance, the variable 


a: —t 0 t 


1 1 1 
F(x): JÈ -G JÆ 


= I, and pr{|x| = i} = 1/12. sP 
n this Ming iaka 4 is id best general estimate of its kind. 
oever, to get an example of the equality, we should have a chan 
es Stochastic variable as we change i Therefore, for a given sto- 
: astic Variable, T: heorem VI is by no means a best estimate for every 
fone g Obviously, if we know the density or probability function 
e ie Stochastic variable, we do not need any estimate like Theorem VI 
th, all. "oO get pr{|x — z| > tc}, we merely integrate (or add) over 
aa ndicateq domain. In between these two extremes, there is the 
kn a ‘lity that f(x) is not definitely known but that there are Some 
Wn res rictions on it. Several “improvements” on Tshebysheff’s 


Here ¢ 


112 THEORY OF PROBABILITY [cHar. 6 


inequality have been proved by assuming such restrictive hypotheses, 
but we shall not attempt to summarize them here. 

It might be interesting to compare the results given by Theorem VI 
with the actual values of pr{|a — z| > ic} for the normal distribution. 


_pr{|x — z| > to} 


Upper limit for Actual value for 
all distributions | normal distribution 


1 1.00 .3174 
2 125 0456 
3 ld .0027 
4 .08 .00006 


2 2 a | ee 

This table shows that Theorem VI does not even come close to giving 
an estimate of the probability significance of ø for a normal distribution- 
But this argument cuts both ways. Our table also shows that the 


normal distribution does not even come close to estimating the possible 
vagaries of a stochastic variable. 


48. Example—Bernoullian Trials 
In Chap. 4 we saw that a Bernoullian sequence of trials was repre- 
sented by a sequence of stochastic variables: 
a: 0 1 
fia): q p 
E(x;) =O-'q+1-p=p. 


The number of successes i 
seen that 


For each i, 
n n trials we have denoted by r. We have 
n 
a D Ti. 
i=1 


E(r) = ZE(z) = np. 
For each i, the variance of 2; is 
E(x?) —# 


Therefore, by Theorem Il, 


=O-g+1-p) —p? = p(l — p) = p4- 


The variables are independent; so Theorem V applies, and 


var(r) = 5 var(x,) = npq. 


SEC. 49] MOMENTS 113 


The student should make a note of these results.. They will be 
needed frequently in later chapters. As we pointed out in our dis- 
cussion of extrasensory perception tests (see example on the matching 
problem earlier in this chapter), the expectation of the number of 
Successes is np no matter how the trials are related. The thing that 
makes the study of Bernoullian trials particularly profitable is that for 
them we have such a simple expression for the variance of the number 
of successes. 


49. Covariance and Correlation 

If x and y are stochastic variables, the variance of each of them is 
the expectation of the square of its deviation from its mean. The 
Covariance of the two is defined as the expectation of the product of 
their deviations from their respective means. That is, 


covar(«,y) = Ef — #)(y — HI. 


ta it follows immediately from the addition theorem (Theorem II) 
nat 
covar(2,y) = E(ay — ty — x9 + 59) 
E(xy) — E(x) E(y). 
y ovar(x,y) = 0, we say that x and y are uncorrelated. The following 
leorem is now obvious. 


Theorem VII. A necessary and sufficient condition that 
E(zy) = E(x)E(y) 
$S that y and y be uncorrelated. 
M comparing Theorems VII and III, we have the following: 
Corollary 1. If ¢ and y are independent, they are uncorrelated. 


l The converse to Corollary 1 is not true. Variables may be uncorre- 
ated but badly dependent. For example, if x is a variable with a con- 
nt density function f(x) = 14 on —1 <a < 1 ard if y = 2°, then 
1 . 
Rs E(x) = je Ux dx = 0; 
E(x)E(y) = 0. 


E(ey) = ie 14r? dx = 0; 


0 
Furthermore, 


r “HAP. 6 
114 THEORY OF PROBABILITY [cH 


thus z and y are uncorrelated. However, for each value of 2, oe i 
only one possible value for y; and for each value of Y, ~_ ar chen 
two possible values for x. Therefore, x and y are far from 
ein of uncorrelated variables in probability yes J 
that in many instances we want to be able to write Æ (x) E(y) for a 
In these instances it is sufficient to postulate that x and y are a 
lated. Theorems depending on this mancuver will obviously hie oo 
x and y are independent, but it is probably a good idea to = oo 
weaker hypothesis when possible. This not only gives a ea 
stronger theorem but frequently makes the hypotheses casier to ve 
in specific instances. ; ; nee 
A further pertinent observation is that if x, y, and z are uncorre 

by pairs, i.e., if 

covar(x,y) = covar(x,z) = covar(y,z) = 0 
it does not follow that 


E(wyz) = E(x)E(y) (2). 
In Chap. 3 we made an analogous observ 
pendence of stochastic variables, and the 
demonstrates our present point too. 
are independent by pairs, therefore unc 


, 


ation with regard to aa 
example given there (page a 
In that example the variables 
correlated by pairs; yet 

E(xyz) = 14 


while 

A) EY)E@) = 14x14 x 4 = i. 
In the proof of Theor 

only to justify the use 

we applied the multi 

Therefore, it follows 

related by pairs” 

reference, let us re 


em V we used the hypothesis of independence 
of the multiplication theorem. Buster 
plication theorem to only two variables at a mE 
from Theorem VII that in this theorem, aer 
will suffice in place of “independent.” For future 
state the theorem in this stronger form. 


Corollary 2, Tf %1, Bap sa 


are 
uncorrelated by pairs, then 


F A ; which 
+ » are stochastic variables whic 


var(2a;) = 5 var(x;). 
The statistician us 


"PEO oe icient of 
es covariance in connection with the coe,fficie? 
linear correlation of $ 


wo stochastic variables: 
Hes covar(a,y) 
[var(x) var(y)]4 


sEc. 50] MOMENTS 115 

In case x — Z and y — g are linearly dependent, i.c., if 
y—-gu=uke—D, 

then, letting u = x — 7 and v = y — ĵ, we have 


E(ku’) kE(u*) ok 44 
c= 7 = apna SpA 3 tl, 
[BVA RU) ~ [PBL Ih 
the sign of r being the same as that of k. The converse is also true; 
te, if |r| = 1, then y — 9 = k(x — ¥). We shall not go into the proof 
of this, but we should show that in all cases |r| < 1. 


Theorem VIII (Schwartz's Inequality). For any stochastic variables 
t and y, 


Ey) < E)E’). 
For every real constant a, the variable (ax — y)? is always non- 
negative and therefore has a nonnegative expectation. That is, 
a? B(x?) — 2aB(xy) + E(y’) 2 0 
ery a. In particular, this holds for 


_ E(ry). 
= E(x)’ 


for ey 


and on substituting this value for a, we have 


[Eley _ AL@WP L pry) > 0. 
Ee) Ee) + Ely’) 2 


The result now follows if we multiply by Z(z*) and transpose. 


To apply Theorem VIII to the correlation coefficient, all we have to 
9 is to translate the origin of each variable to its mean; then we sub- 
Stitute into Theorem VIII, divide by E(x?)E(y*), and take square 
*oots. The resulting inequality is just exactly |r] < 1. 


40, Normal Correlation 
There ig one very important case in which the coefficient of linear 
Correlation gives considerably more information than we have sug- 
Eested so far. If xand y have the joint density function postulated in 
corem TX below, they are said to be in normal correlation. Such 
Variables appear in many different connections, and we want to develop 
ere a few of their more important properties. 


116 THEORY OF PROBABILITY [cuar. 6 


Theorem IX. If x and y have the joint density function 


1 1 (x? _ rey , a] 
n= Qrowe V1 — r? we |- W= ANA aie o3) 


where |r| < 1, then each of the following statements holds: 


(a) xis normally distributed with variance o?, and y is normally dis- 
tributed with variance o2. 

(b) There exist independent normally distributed variables u and 
v such that each of the variables x and y is a linear combination of 4 
and v. 

(c) The constant r in the above joint density function is the correla- 
tion coefficient of x and y. 


(d) x and y are independent if and only if they are uncorrelated. 
The proof of (a 


If we complete ti 
exponent 


) consists in finding the marginal density mere’ 
he square in the exponent of (x,y), we get for a! 


= 1 ¥ _ r\* gt |, 
2(1 = 7%) [(é-2) TE meh 
therefore 


f(z) = ni o(z,y)dy 


= wee A 1 oe a 1 f/u y w 
EVO =e aN or — 7) xJ 20 = rA |o o 
1 


= ——— 627/202 


O71 v 2r 
Similarly, 


I0) = — errs, 
i C2 V 2r 

and (a) is proved, 
Actually, (b) follows i 
geometry. Independ 
by a joint density fu; 


tic 
mmediately from considerations of may e 
ent normal variables u and v are character) 
netion of the form 


Ke-2u? + o2y2) 


a rotation of the axe: 


. ar 
: s always giv ariables as line? 
nations of the ht YS gives the old variab 


Now, 
oe and there is always a rotation that W! 


sec. 50] MOMENTS 117 


remove the cross-product term. Furthermore, the condition |r] < 1 
makes the discriminant of the exponent negative: 


—2r\* -yn 
") geek ge eh 
ci o3 


0102 


z rotation always leads to a sum of squares (equation of an ellipse). 
ae i rigid transformation like a rotation carries the area differ- 
E dx dy into du dv; so we get a new joint density function of the 
esired form, and (b) is proved. 
hea to prove (c) we need to look a little more closely at the 
a S of this rotation transformation. If a is the angle of rotation, 
have the relations (found in any analytic geometry book): 


(1) i 
x =ucosa—vsina, 
(2) y=usna+v cos a. 
u = x cos æ + y sin a, 
v = —x sin a + y cosa, 


F 
of thermore, the rotation which removes the xy term in the exponent 
(x,y) is characterized by the condition 


3 ä 
(83) tan 2a = ar Ta 
oy aag 03 


F 

rom (1) we find that 

xy = (u? — v?) sin a cos a + w(cos* a — sin? a) 

X = 14(u2 — v?) sin 2a + w cos 2a. 

o 

W u and v are independent and normal; so E(w) =0. Therefore, 

‘ = Elu? — v?) sin 2a. 
From (2) we get E(xy) = E(u v?) sin 2a 
2 — v? = (æ? — y?) (cos? a — sin? a) + 4ay sin æ cos a 

= (x? — y?) cos 2a + 2ay sin 2a; 


u 
50 
‘ E(xy) = WE (x? — y?) sin 2æ cos 2a + E(xy) sin? 2a. 
ta ‘ 
OSposing, we have 


that is, 2E(xy)(1 — sin? 2a) = E(x? — °) sin 2a cos 2a; 


2E (xy) cos? 2a = [E(x?) — E(y?)] sin 2a cos 2a 


Therefore, = (oc? — 03) sin 2a cos 2a 


E(ay) = 4(o3 — c2) tan 2a; 


118 a? THEORY OF PROBABILITY [onar. 6 


so by (3) 
E(xy) = rows. 
This proves (c). - ; F ai 
At this stage, (d) is obvious. The joint density function g(x,y 
reduces to the form for independent variables if and only if r = 0. 
One final remark of interest about variables in normal correlation 
is that’ the loci 
am 2rzy L = constant 
ot aye | of 
are called (for obvious reasons) the ellipses of equal probability. 
Equation (3) shows how the orientation of the axes of these ellipses can 
be determined from the standard deviations ø, and c and the correla- 
tion coefficient r. 


REFERENCES FOR FURTHER STUDY 
Coolidge, An Introduction to Mathematical Probability, Chap. VIII. 
Levy and Roth, Elements of Probability, Chap. VIII. 5 
Uspensky, Introduction to Mathematical Probability, Chaps. IX, XV- 


PROBLEMS 
1. Five dice are tossed to: 
times. What is the expecta’ 
appear? 
2. The face cards are dis 
is drawn from each deck. 
numbers drawn? 


gether, the experiment being repeated 2 
tion of the number of times exactly 3 a 

Ans. ie 
carded from 2 decks of cards; then 1 oe 
What is the expectation of the sum of the 


Ans. 11. 
3. Two cards are drawn simultaneously from 1 of the decks 1P 
Prob. 2. What is the expectation of the sum? Ans. 
4. Thirtee 


n cards are drawn si 
aces count 1, f 


the expectatio 


multaneously from a deck of 52. j 
ace cards 10, and others according to denomination, fing 
n of the total score on the 13 cards. Ans, 8 
5. There are 1,000 tickets in a certain lottery 1,000. 
If 30 different winners are picked, what i j 
the winning numbers? 


numbered 1 to 


2 Ans. 350/12. 
8. If x is chosen at random in the interval (a,b), what is B(x)’ 
var(x)? pits 


Ans. (a+ b)/2, (b — a)?/12- 


MOMENTS x Ji 119 


9. Let x and y be chosen independently and at random between 0 


and 1. Find the expectation and variance of x + y. Ans. 1, 3%. 
10. If x is normally distributed with variance unity, what is the 
expectation of x?? the variance of x2? Ans. 1, 2. 
11. TE Fiti a 2 « , čna are independent and if each is normally dis- 
tributed with unit variance, what is the expectation of =x}? the vari- 
ance of 2x?? Ans. - n, 2n. 


p 12. A’ Poisson sequence of trials is a sequence of independent trials 
in which (unlike the Bernoullian case) the probability of success varies 
from trial to trial. Let the probability of success on the 7th trial be 
Pi (failure, qi), and let p be the mean probability in n trials: 


n 


= 4 Sm, 
P= Ne Pi 


i=l 


Show that the expectation of the number of successes in n trials is np 
a . : 
and the variance of the number of successes is 


5 P:i- 


i=l 


al n Two dice are thrown until a 7 is thrown. Find the most prob- 
ee number of throws and the expectation of the number of throws. 

Ans. 1, 6. 
i. Let x have a Poisson distribution; that is, x = 0, 1, 2,3, .. . 

La Probability function f(a) = at e/a. Show that 
E(x?) = aE(z + 1). 

; 15. Find the expectation and variance of a variable with the Poisson 
'Stribution, Ans. a, a. 


16. sz, Petersburg Problem. A player tosses a coin. Tf his first toss 


is} f : ; 
Oe ia he wins a dollar. If his second toss is heads, he wins another 
ar, 


Wi 


= °xpectation of the amount paid him in a single game. Ans. ©. 
üs wi Since the game itt Prob. 16 seems impractical, suppose that, once 
Teceiy, npa reach $1,000,000 by the rules of Prob. 16, the player 
winni es $1,000,000 for each additional heads instead of having his 
ngs doubled each time. What is the expectation of the amount 

n under these rules? Ans. 10.96. 


“HAP. 6 
120 THEORY OF PROBABILITY [cmar 


18. Let x and y be independent and normally distributed, each with 
variance unity. Find the expectation and variance of ~/x? + y` 


Ans. +~/72/2, 2 — (1/2). 


19. Let x be normally distributed with variance unity. Find the 


expectation of |z|. „Ans. y 
20. Let x be normally distributed with variance unity. Fino P 
expectation and variance of e°7, Amis: gv; ee? — iy 


21. A variable x with the density function 
1 X :>0) 
= (log 2)2/2 (> 
f(z) Per 


is said to have a logarithmiconormal distribution. Find the orpona 
tion and variance of this variable. Hint: Compare Prob. 20 abov 


with Prob. 24, Chap. 5. Ans. v/e, è a K 
22. Let x have a density function f(x) = 1e-l=|. Find the expec 3 
tion and variance of x. Ans. 0,2 


23. Pareto’s distribution is defined by the density function 


atl 
a (x 
f(z) = (z6) for x > zo 


0 for x < to. 
th this distribution has a moment of order h 
In particular, if æ > 1, show that 

E(x) = axo/(a — 1). 

24. Generalize (a) of Theorem IV. If a is any Constant 

var(x) = El(x — a@)"] — [E(x — a)]?. 

25. Show from Prob. 2 

about which its mean squ: 


Show that a variable wi 
if and only if a > k. 


4 that the mean of a variable is the point 
are deviation is a minimum. 3B 
26. Use the method of proof of Theorem VI to show that, if #16 e 
stochastic variable whose density function is zero for x < 0, then, f0 
t > 0, prix > t} < zt 


i Show that Theorem VI is a special case of the theorem in Prob- 


28. Show that for any pair of stochastic variables 


var(v + y) = var(x) + var(y) + 2 covar(z,y)- 
29. Generalize Prob. 28, If ti, ae, 


ic 
: hasti 
A + » nis any set of n stoc 
variables, ie SY 


xa 0) z) = > var(e,) 3h: D covar(z;,y;). 


ixj 


MOMENTS , 121 


30. Find the coefficient. of correlation of the variables in Prob. 31, 
Chap. 3. Ans. 0. 

31. If the joint density function for x and y is olz) =x+y 
O<z <10<y< 1) with g = 0 otherwise, what is the correlation 
Coefficient for z and y? Ans. —\41. 

32. Let x and*y be.chosen independently and at random between 
0 and 1. Find the correlation coefficient for u = max.(x,y) and 
"= min.(,y). Hint: See Probs: 11 to 13, Chap. 5. Ans. 4. 

33. Show that, if x is chosen at random between —r and T, sin mx 
and sin ng are uncorrelated when m # n. 

34. Show that, if x and y are in normal correlation, then ax and by 
are in normal correlation. 

3 - For x and y in normal correlation and independent, what is the 
entation of the ellipse of equal probability? 

» Experience shows that in artillery fire the pattern of hits forms 
an ®pproximately normal distribution with the major axis of the ellipse 
$ equal Probability on the line from gun to target. Given a record 
: 4 large number of hits from the same gun in a given area, show how 

Auation (3), Sec. 50, can be used to determine approximately the 
tection from there to the gun. 


CHAPTER 7 
SPECIAL TOPICS IN CALCULUS 


There are a number of special formulas and identities which are ve 
usually included in a first course in calculus but which are very bene 
tools in many computations in probability theory. It is the purpose i 
this chapter to summarize these. In later portions of this book vi 
shall make a number of references to the formulas developed here, bu 4 
we shall not begin to indicate all the uses that can be made of oe 
material. The student who is interested in further work in probability 
and statistics would do well to make the information presented in this 
chapter a part of his general repertory. 


51. The Beta and Gamma Functions 


The beta and g 


n A K inte- 
amma functions are sometimes called Eulerian int 
grals of the first 


and second kind, respectively. They are defined a 


follows; 
a) B(z,y) = i PL — i) dt; 
@) Te) = f° eed, 


A little knowled 
B(z,y) is defined if 
defined if x > 0. 


ge of improper integrals tells us immediately oe 
x and y are both greater than zero and that T (2) en 
In what follows, we shall understand that, whenev 

, the arguments used satisfy these conditions. 


On integrating (2) by parts, we get 
Í, me dt = =e + @ 1) h ee di, 

which gives us the important recursion formula: 

(3) T@) = (a = 1)T(@ — 1). 

So, starting with the special Case 


(4) TQ) = fea =1, 


2B 
we can apply (3) over and over again to find that r(2) = 1, r(8) =” 
122 


123 
Suc. 51] SPECIAL TOPICS#IN CALCULUS 

T(4) = 6, and in general 

(5) Tin + 1) = nl. 


; i rtance of the gamma 
This relation should suggest something of the importance 
suotin; variable ¢ = av where 
nis tke integral (2) let us make the change of variable 
a is a constant: s 
= |” a v=—e-*ra do. 
T@) = f a 


Si la: 
Factoring out a7, we have another useful formu 
to pai 
(6) a 5 , eee 
ake the ¢ 
Turning now to the beta integral (1), i etl Basa ine 
Variable t= 1 — 5 This gives us the importan 


z—lgu-1(— 
g Bæn) = [° (1 — ssd) 


= pema — s) ds 
0 


= BYy,«). 
Returning to (1), we make the substitutions os 
1 dt = cy 
a a as org a+ wu) 
This gives 
pE an 
®) Biw,y) = f ato 


From (6) we have 


2 Te +4). 
I prt—le-A+0" dy = (0 F uy 
0 


4 with respect to u 
If We multiply each side by w! du and integrate v 
from 0 %, we have 


ma du. 
f VEHI-Le~v dy f . ule du = T(x + y) { 0 F ur 
$ o 


i ral 
8) to the integra. 
Abplying (6) to the second integral on the left and (8) 
n the right, we get 


2 = )By,2). 
T) Í, vee dv = T(x + y 


124 THEORY OF PROBABILITY [cuap. 7 
So, from (2) and (7), we get the important relation between B and T: 


T@)r 
(9) Bow) = Fer ay 


If, in (9), we replace B(z,y) by its integral expression (1) and make 
the change of variable 


t = cos? 0, di = 2 cos @ sin 6 dé, 


TOLG og Pn wes is tiie ee 
EU -2 f (cos 0)®*-:(sin 0)%-! d0; 


hence, setting « = y = 14 and noting (4), we have 


m E-POS m- 
us, 


(10) TGA) = vr. 


On writing the integral (1) for B(z,z): 


we have 


Bor) = fe — yas, 
ite 
we see that the integrand is symmetric about ¿ = 14; so we can writ 


Bora) = 2 f ma — y= a, 
The change of variable 


t=Al-Vs), -d= d= -Ms ds 
then gives 
Bwa) = are f} (1 — sjes ds 
= 2) =B(x,14), 


If, in this equation, we transform the beta functions into gamma fung 
tions by means of (9), we have the formula of Legendre: 


(11) T2104) = 2P) + 4), 
52. The 0, o Notation 


A very useful tool ina 
notation. Ify = 


(say, + ©), the 


y 

utleo 
nalysis is the so-called “big O” and ee. 
y(x) and some limit operation for x is yndi 
n o(y) is used to denote a function a(x) such 


lim 2 = 0. 
y 


O(y) is used to den 


i ote a function z 
sufficiently large, These are gen 


wea 
‘or 
() such that z/y is bounded f 
eric notations and do not 


denot? 


SEC. 52] SPECIAL TOPICS IN CALCULUS 125 


Specific functions. This makes them very useful for describing 
Temainder terms whose specific values we do not care about but whose 
asymptotic behavior is important. 

Tn Particular, o(1) means a function which tends to zero. Several 
_SPecial properties of o(1) functions should be pointed out. First, the 
Sum or difference of o(1)’s is o(1). For exponentials and logarithms 
We have the relations 

log[1 + 0(1)] = o(1); 
et) = 1 + o(1). 
g The relation f(x) ~ g(x) is an important one in analysis. It is read 
F(x) is asymptotic to g(x)” and means, by definition, 


c : 
learly, this may be written 


or f(z) = g(x)[1 + 0(1)], 
log f = log g + o(1). 
We shall frequently be interested in the asymptotic behavior of 


xpressiong of the form (x + a) log(« + b)asx— œ. It will simplify 


a ters if We figure this out and record the result now. From Mac- 
"n’s formula with remainder, we find that 


ag log(1 + z) = z + O(2?) 
250; therefore, es 
(e+ a) log(x + b) = (x + a) log x ( + D) 
b 
= (x + a) log x + (x + a) log (: ae 2) 


= (x + a) log £ + @ + a) |- i a9 | 
So let ug J; = (x + a) loga + b + Oe) + a0). 
's list for reference: 
The « (e + a) log(e + b) = z log æ + a log z + b + o(1). 


test big 0” relationship is the thing involved in the comparison 
M convergence of infinite series. Thus, one of our chief uses of it 
e to note that, if p > 1, 


is 
Convergent, 


126 THEORY OF PROBABILITY [cuar. 7 
53. Stirling’s Theorem 


There is a famous asymptotic formula for n! which oo a 
under the name of Stirling’s theorem, though it appears that a lo 
the credit for it should go to de Moivre. 


Theorem I. 
n! = nten Vr em 
where 


1 1 
(1) Tan +6 <% ioe 


We shall prove this formula in its logarithmic form: 


(2) log(n!) = (n + 34) logn > n + 15 log(2r) + en. 
Now, 


n 
log(n!) = log k. 
2 
If we draw a graph of lo 


. of 
E £, we get a picture of this sum as the sum 
areas of rectangles: 


Fig. 24, 
From this it appears that 
n 
log k 
is approximately equal to p= 
+1 
f log x dx 


l :¢+ the 
However, we see that there is a closer approximation if we shift 
rectangles 14 unit to the left: 


SEC. 53] SPECIAL TOPICS IN CALCULUS 127 


Fi, 25, 


We Shall use the approximation suggested by this figure, that is, 


log(n!) af a log x dx. 


First, however, let us find out how good an approximation itis. To do 
us, We note that 


k-35 
pa ; log tdt = ai log ¿ dt — J. “log t dt. 
S the first ittenni we make the change of variable 


t=k+a, dt = dx; 


t=k-—vx, dt = —dx. 


keg x x 
ji log ¿ dt = Í log(k + x)dx + fi log(k — x)dx 
k~i 0 0 


1g 
= li log(k? — x°)dx 


0 
x 
Now = log k + f” ‘tog (1 - B 5) da. 


> the series expansion 


g? g” Igt Ig 
log (1 -#) ~~ 78 T 3p 3K 


Co 
Vergent for 22/2 < 1; therefore we may integrate from 0 to 15, 


128 THEORY OF PROBABILITY ([cHap. 7 


term by term, and get 


3 x x? zë 
MO n o i 
1 1 1 —_ | 
= z(- 24 320k? ) 
= O(k-*), 
Therefore, 
JE log t dt = log b + 009), 
and 


log(n!) = log k 
2 


= > Pies log ¢ dt + o=) | 
1 


k= 


Il 


p” log ¢ dt + $ O(k-*) 


k=1 
= tlogt — ag ae > O=) + 0(1) 


k=1 
= (n + 14) login + 14) — n + Cy + o(1) 


nt Ci is —34 log(2s) + 200). On applying 
(1), Sec. 52, we get an additional constant term of —14; so le 
C2 = C1 — , we have 


(3) log(n!) = 


where the consta 


(n + ¥) logn —n + C2 + o(1). two 
Vv 
J and (2) shows that we have yet to prove can 
things: first that C: = 14% log(2r), and second that the error term 
merely o(1), really satisfies Eq. (1). iling’s 
fs in circulation that the constant in A give 
ey are all rather long. In Chap. 9 wesha TO 
ng’s theorem in which it will be obvious 


ions that this must be the case. So, for 
present, let us write 


SEC. 53] SPECIAL TOPICS IN CALCULUS 129 


Now, (1) reads 
(4) an <0 < Br. 


= show these inequalities, we note that (3) tells us that en — 0; there- 

ore a, > 0 and 8,— 0, and (4) follows provided that a, is steadily 
increasing and 8, is steadily decreasing. So we shall complete the 
Proof by showing that 


6) anyi — an > 0 
and 
(6) Batr — Ba <0. 


First, let us look at 
n! 


nate 4/ Qn 
; a "| 
og | ¢\ 4 
1+ (n+ 2) 108 (4 z} 


€nt-1 — én = lo; 


t 


If w 
eset z = n + 14, we have 


=a 
1 + eos (2548) 


E 1 — (1/2) 
sills |i F07 a 


Engi — € 


I 


n 


I 
m 
+ 
x 
a 
og 
aS 
= 
| 
Sais 
NY 
| 
Cy 
7 
Ss 
= 
+ 
yI 
N 
Sam 
te) 


=1 oh i a ) 
+2|( 22 B22 Des 
i f 1 
-4-ta )| 
=1 Bc eae ister ) 
+2( z 1228 = 802° 
1222 80z4 


Qk F DEAF 


Il 

l 
i 

£ 


HAP. 7 
í THEORY OF PROBABILITY [cu 

130 

Now, 


Ont = On = Engi = én 


1 1 
BaF * n 


1 
Te T TT 


1 1 
=> > k F 1) (22) i 122? Zy (423+ 


k=1 k=0 


2 


1 
wi > Qk F e= + f  3(22)™ 
k=1 t= 


= 2k — 2 

~ U 30k + 1)(22)% 
k=1 

> 0. 


This proves (5). Equation (6) is much easier: 


4 
en Ba ig “Be +33) + pat 


= 1 1 
5 ena -BEF tz 


This completes the Proof of Stirling’s theorem, 


54. Complex Exponentials 


The function 
y the series 


a) 


d 
Š Boe \ te define 
e, where 7 is the imaginary unit (2 = —1), isd 


fy ; (iy)? (iy)? 
G T+ ty + “or + a + 
= 5 y? iy? 
T+ iy — pS ee 


3! 


SEC. 54] SPECIAL TOPICS IN CALCULUS 131 


There are a number of properties of this function that we want to list 
here for use later. If we take the familiar Maclaurin expansions 


y? 
ay = 1 — Fb Bae, 
. y3 y5 
siny =, =o e ee 


multiply the second one by i, add and compare (1), we have the Euler 
formula: 


a eù = cos y + i sin y. 
Replacing y by —y in (2), we have 


e = cos(—y) + i sin(—y) 
= cos y — i sin y; 


sid, Subtracting this result from (2), we have 


(3) 


j E ag 5 
sin y = z (ev = ei). 


t t we plot the complex number ec” in the complex plane, (2) tells us 
at its Coordinates are cos y and sin y. Hence, its absolute value 


(distance from the origin) is 
(cos? y + sin? y)* = 1. 


Sow 
ve have the important fact that 


4 

f i Jeu] = 1 

ie . 

Pun = real y's. In particular, this tells us that the exponential of a 


ei l y's. ; 
ci y 1s uniformly bounded. . 

i i ri ici r iy 
Since er important consequence of (2) is the periodicity of e, 


and cos(y + 2kr) = cos y 
i i Qhr) = sin y, 
it follows that sin(y + 2k) y 
(5 
a cilyttkr) = civ 
5 i > + . S ae 
that ay integer k, From this we obtain the interesting information 


= dian ~ 19 if k #0, 
SPUS | ge een 


r CHAP. 7 
132 THEORY OF PROBABILITY [ 


To prove this, we note that, if k = 0, c#” = 1 and 


f dy = 2r. 


i e*u dy = ie 


4 (cite a ioe 


Tf k #0, 


=F 


Lai irok) 
= = [ets — eita] 
ik 


=0 
because of (5). 


55. The Integral of sin x/x 


In Chap. 9 we shall want to use the result 


° sin 
a) pa ra 
This result can be f. 


but the stud 
ment: If we 


j e cosurdr= 4 _ Ë 


[-= e- cos ux dx; 
Uso 


u~ 


and, on solving this equation for the integral, we get 


% 
Í e* cos ux dr = 
o 


ow, we integrate with respect to u from 
integration on the left-hand side of the 


t 
w FE der 
ri 
N 0 to 1 and reverse ite 
of equation. This give 
s . i 
Í caint dx = arctan =- 
0 x t se; 8° 
; a «hig Co 
Taking the limit under the integral sign can be justified in this ¢ 
we let £> 0 and have 


“i 
sin x T 
i q eee 

Now, So s 


sin(~2) _ sinz, 


=r x’ 


SEC. 55] SPECIAL TOPICS IN CALCULUS 133 


* sin x “sin x 
f ae = 2 f dx = r. 
=e @ o 2 


REFERENCES FOR FURTHER STUDY 


Churchill, Introduction to Complex Variables and Applications, New 
_ York (1948), Chap. 3. 
ge cians, The Theory of Functions, Oxford (1939), pp. 55-58. 
e la Vallée Poussin, Cours d'analyse infinitésimale, 1st American ed., 
a York (1946), Vol. II, Chap. III. 
er, Advanced Calculus, New York (1947), Chap. XI. 


so 


PROBLEMS 


1. Compute the Stirling theorem approximation to 1! Ans. .92. 

on Compute the Stirling theorem approximation to n! for n = 1, 

that a » 10. _Compare the results with the actual values of nl. Note 

ethan e gi itself gets larger as n increases, but the relative error gets 
3 ai This is typical of asymptotic formulas. 

Chay, se Stirling’s theorem to get a numerical answer to Prob. 1, 
ap. 3. Hint: Use logarithms. Ans. 4.474 X 107. 

Chat Use Stirling’s theorem to get a numerical answer to Prob. 27, 
P. 4. Ans. 2.52 X 107. 


ae Using (5), Sec. 51, and (1), Sec. 52, convert (3), See. 53, into the 


log T(n) = (n — 4) logn — n + C2 + o(1). 


Not 
e that this has been proved only for n an integer. 
58) th, 2 can be shown (ef. Titchmarsh, The Theory of Functions, page 
a 


T(x) 
T(z + a) 


E 


Besta 
this result in conjunction with Prob. 5 to show that 
log T(z) = (x — 14) log x — x + Ce + o(1). 


Hint: 
appl, pe © is not an integer, set 1 = n +a where 0 < a < 1, and 

he formula given above. Simplify by means of (1), Sec. 52. 
Sec, 5 Substitute the result of Prob. 6 into the Legendre formula (11), 
Show that all terms except the constants cancel, and obtain 


the À 
Te; 
Sult (not yet proved in the text) that C2 = 14 log(27). 


Sei 
134 THEORY OF PROBABILITY [cHaP 


8. Show that Probs. 5 to 7, in addition to evaluating the constant, 
give a Stirling approximation to the T function: 


T(x) ~ rez Vr, 
where x need not be restricted to integer values. 


9. Apply (1), Sec. 52, to the formula of Theorem I to get an alter- 
nate form of Stirling’s theorem: 


n! = (n + V4) mMe--8 From, 


b. 
10. By the methods used to prove (1), Sec. 53, show that 7, of Pro 
9 satisfies the inequalities 


1 1 
~ Bin-F 1a S < — peas 
11. Apply (9), See. 51, to show that 


1 
Bin — kk + 1) = CETA 
„I. 
12. Use the result of Prob. 11 to complete Probs. 26 and 27, Chap 


. a re i i 3 cific 
Hint: The additive constant can be obtained by substituting a SPe 
value of x. Let tal 


a i 7 ‘ned by the 
13. The x? distribution with n degrees of freedom is define 
density function 


1 
kn H a ~ a(n/2)—1,—2/2 
(z) | | aaa 


r i +m and 
forz> 0 with k, = 0 for x < 0, Set up the convolution for kn 


En. 


: à us? ü ayn IAAT 
Ans. Kim(u) *Ten(u) = E (u — x)? T 
m-tn)/2 al i 
14. Substitute 2 = e/a ig ee TZB) Jo ), See. 51, $° 
m AOSAN a c ay ty this Goes a ly (9 i 
show that k,, * kn = Is : ENPE SEE 


Brigin a 
15. Starting With P a 


yesa 
i } rob. 21, Chap. 5, use Prob. 14 above t0 aih ž 
inductive proof that, if ei, ta ča are independent an 
normally distributed i 


has a x2 distribution wW 


ith Pees of fr, 
16. Use (6), Sec, n degrees of freedom. 


(x°). f 
» to compute directly E(x?) and vale m 3" 


1 


SPECIAL TOPICS IN CALCULUS 135 


17. The density function 
_ F maiin Cl! — 2 Nala 
fe) = Stam) 
defines the beta distribution with parameters n and m. Given that 
€ has this distribution, find E(x) and var(x). 
1 Ans. n/(n +m), nm/(n + m)2(n +m + 1), 
8. Show from the definition of o( ) that 


n-o(1) = o(n), 
0(1)/n = o(1/n), 
[o(1/n)]*? = o(n-), 
[O(1/n) + o(1/n)]? = O(n-2) = o(1/n). 
cw Show that if f(x) has a continuous derivative in some neighbor- 
been of 0 and if f(0) = f'(0) = 0, then f(x) = o(x) as x > 0. In 
ie eg under these conditions f(a/n) = o(1/n) asn— œ. Hint: 
hopital’s tule—lim f/g = lim f’/g'. a 
Some new tlize Prob. 19. If f(x) has a continuous nth derivative in 
© Neighborhood of 0 and if 


th f0) = f'(0) =f"(0)= +++ =f™(0) = 0, 

en Fæ) = (a) as z —> 0, 
$ Show that, if f (x) has a continuous nth derivative in some neigh- 
od of 0, then 


F(z) = £0) + S'O +o E hn = x” + o(a”). 


Hing: 
22, Transpose, and apply Prob. 20. : , 
Teman „° MPare the result of Prob. 21 with the Maclaurin formula with 
Mainder: 


= (n) 0 (m+) (Ox) 4 
F(x) = 70) +f'O)at--- +o ag +E x pt + 
Ote , io 
but vi this latter result requires the existence of one more derivative 
23, ah the Stronger result, remainder = O(x"*'). 
Ow from (1), Sec. 52, that lim[1 + (a/n)}" = e. 
n= © 


4. ; 
de bey that n log[1 + (a/n) + o(1/n)] = a + o(1). Hint: Let 


25, Saven), and apply Prob. 21 to log(1 + 2). 


| 26, 

tinge ye 0m Probs, 21 and 25 show that, if {(0) = 1 and f’(x) is con- 

* lim [y Some neighborhood of 0 with f/(0) = a, then, for each fixed 
ma Ve /n)In = gar. 


ae 
136 THEORY OF PROBABILITY [cHaP. | 


27. Show that cos y = 14(ev + e~t), 
28. Show that 


sin my sin ny = — [emv Hf enny — pilm—ny — giln—my], 


29. Using Prob. 28 and (6), Sec. 54, verify that, if m and n are 
unequal integers, 
sin my sin ny dy = 0. 
30. Show that f e 


a 
f et dy = sin at. 


a 


CHAPTER 8 
d LIMIT THEOREMS 


SN of the most important accomplishments of the mathematicians 
Ei oe worked with probability theory has been the formulation and 
ay of the so-called limit theorems. An exhaustive study of this 
als vl of work is beyond the scope of an elementary course, but we shall 
Pr ine we can without resorting to advaneed analytical concepts. 
Ko ability limit theorems are concerned with infinite sequences of 
a PAN variables, and in general they take the form, “Tf the varia- 
(or a a sequence satisfy certain conditions, then such and such a law 
ea applies to that sequence.” A given law by itself does not 
Which a theorem, but it describes a class of limit theorems—those 
of thi State conditions under which the given law holds. The purpose 
sed chapter is to give a classification of the limit theorems by 
ane ing the more important laws that have been studied. Then 
detailed» 10, and 11 take up three of these classifications for a more 
study. 


56. 
The Central Limit Law 


Let ; 
and he Tə, 2, . . . be an infinite sequence of stochastic variables, 
Variables that for each n there is a joint density function for the 

SST; a a a , &n This joint density function determines a 


ensit; : 
Y function for the variable 


Xn = (x: — 4). 
2, 
The quest; 
dene: stion then arises as to the limiting form (as n— œ) of the 


| Cnsit, d : 
Inde y function for X,. Now (Theorem V, Chap. 6), if the a’s are 
Pendent, n 


n 
var(X,) = Pi var(z:); 
therer i=l 
` refore į : $ 
disperses in all the interesting cases the values of X» will be too widely 
d to make the study of X, itself very profitable. However, 
137 . 


138 THEORY OF PROBABILITY [cuap. 8 


the variable 
X. 


am V Sate =) 


has a variance equal to unity, and the limit of its density function 
should give us a very good picture of the limiting distribution for the 
sum of the 2’s. 

The variable S, is sometimes called the sum in standard units. A 
deviation of one unit in S» corresponds to a deviation of o (one standard 
deviation) in X,. 

The density function for S, itself certainly varies from one sequence 
{x;} to another, and at first one might suspect that the limiting density 
function would too. It turns out, however, that, for all sequences of 
2’s satisfying a certain rather general condition, the limiting distribu- 
tion for S, is the normal distribution. This suggests the first of our 
important laws. 


S 


Definition 1. The Sequence Tı, T2, T3, . . . of stochastic variables 
obeys the central limit law provided that, for every a, b with a < b, 


lim pria < Sa = b} = L 1 enn dz, 
n= o 
a 


where = 
YG = Z,) 
Sn = i=l = n 
g Ņ\ Jé 
[ve (X =)] 


The history of this important law dates back to the first of the 
eighteenth century. Apparently de Moivre (1667-1754) was the first 
to discover that the variables describing a Bernoullian sequence of 
trials obey the central limit law—though, naturally, he did not state 
his result in those terms, Laplace ( 1749-1827) spent some 20 years on 
the general problem of the limiting distribution of Sa. He made some 
improvements on de Moivre’s proof for the Bernoullian case and was 
the first to suggest the general theorem that the normal distribution iB 
the limiting one in a wide variety of cases. Laplace gave a proof of this 
general theorem, but it is not considered satisfactory by present-day 
standards of rigor. The first really satisfactory proof of Laplace's 
theorem was given in 1901 by Liapounoff. For this reason, the 
so-called central limit theorem is frequently referred to as the Laplace- 


suc. 57] LIMIT THEOREMS 139 


Liapounoff theorem. The title “central limit theorem” seems to have 
been originated by Polya in 1920. There have been numerous studies 
of the problem made since 1901. One important goal of these has been 
to improve on Liapounoff’s description of the class of sequences {x} 
for which the law holds. A significant improvement of this sort was 
given in 1922 by Lindeberg. The true importance of Lindeberg’s con- 
dition appeared in 1935, when Feller showed that it is necessary and 
sufficient for the law to hold in case the variables x; are totally inde- 
pendent. As yet, the theory for dependent variables is very sketchy. 
Only a few special cases of this type have been treated. 

A physical interpretation of the central limit law would be something 
like this: If a physical effect is due to a large number of independent 
Causes, then regardless of the distributions of the measurements of the 
individual causes, if the central limit law applies, we should expect the 
measurements of the over-all effect to exhibit a nearly normal distribu- 
tion. Chapter 9 is devoted to a discussion of the conditions under 
which the central limit law holds, together with a number of applications. 


57. The Poisson Distribution 

In a central limit law situation the limiting distribution is a con- 
tinuous one whether the individual stochastic variables are or not. 
The Poisson law furnishes an example of a discrete limiting distribution. 


Definition 2. A sequence 21, Ta Ts... of stochastic variables 
having probability functions fi, fo, fs, - . - , respectively, obeys the 
asson limit law provided that, for each integer r > 0, 
ae 
r! 


lim fa(r) = 

From a theoretical point of view, the Poisson law seems to apply to 

a much narrower field than the central limit law. However, the num- 

ber of practical problems that come within its scope gives it a place of 

real importance among the limit theorems. One explanation of this 
Wide applicability lies in the fact that the probability functions 


= win) = 0,(2) (1-$) 


escribe variables obeying the Poisson limit law. Now, fn(r), as 
defined above, is just the probability of r successes in n Bernoullian 
trials with probability a/n of success on each trial. If we use the 
Sisson law to get an approximation for large n and note that n large 
makes p = a/n small, then we have the usual Poisson theorem for 


140 THEORY OF PROBABILITY [cuar. 8 


rare events: If p is very small, then the probability of r successes in = 
large number n of Bernoullian trials is approximately (npyre-"?/r!. 

In Chap. 10 we shall give a proof that the Poisson law applies to 
the case (1), and that proof will include an estimate of the proper inter- 
pretation of the words “small,” “large,” and “approximately” in the 
above statement. Weshall also discuss some specific applications there. 

At this stage we should get clearly in mind the distinction between 
the Poisson limit theorem and the central limit theorem as applied to 
the case of Bernoullian trials. The number of successes in n Bernoul- 
lian trials is the sum of n independent stochastic variables, and it is 
easy to show that these obey the central limit law. Therefore, the 
normal density function ought to give us approximate information on 

Probabilities for the number of successes. Where, then, does the 
Poisson formula come in? The answer is that, if p is very small, it 
gives a better approximation than the central limit law does. 

As approximations to Bernoullian probabilities, then, these two laws 
cover different cases of the same type of situation. Incorporated in 
limit theorems, however, they refer to essentially different types of 
situations. Givena single infinite Bernoullian sequence, p is constant, 
and the limiting distribution is normal in all cases. However, given a 
sequence of approximate descriptions of a physical situation each 0 
which consists of a finite set of Bernoullian trials with p changing foro 
one set of trials to another so that np is constant, then the limiting dis- 
tribution is given by the Poisson formula. 


58. The Laws of Large Numbers 


‘ . Sg its 
Returning now to a Sequence a, £e, x3, . . . of variables and i 
partial sums 
n 
X, = > (a — z), 
: f i=1 
we define the arithmetic means 


M, = Že. 
n 


Tt is these variables M, that the laws of large numbers deal with. 
Definition 3. A se 


obeys the weak law 
é> 0, 


n sa ables 
quence 21, £z 23, . . . of stochastic pee 
of large numbers provided that, for every cons 


lim pr{|M,| < e} = 1. 
n= a 


i r f 
: The weak law of large numbers applies to a very wide variat ae 
situations; and where it does apply, it gives some very interes 


SEC. 58] LIMIT THEOREMS 141 


information about expectations of stochastic variables. Note that 


Me = 4 ) ER ) is 
n n 
i=n i=1 


that is, AZ, is the difference between the arithmetic mean of the values 
of the «,’s and the arithmetic mean of their expectations. The weak 
law of large numbers might, therefore, be restated in this way: Given 
any two numbers e > 0 and 7 > 0, there is an integer N = N (e,n) such 
that, if n > N, then the probability is greater than 1 — 7 that the 
arithmetic mean of the first n variables will differ from the arithmetic 
mean of the first n expectations by an amount less in absolute value 
than e. 

From a strictly theoretical point of view, the weak law of large num- 
bers is a first cousin to the central limit law. They are each concerned 
With the limiting distribution of a normalized sum of n variables from * 
a given sequence. The central limit law states that the limiting dis- 
tribution for the variable S, is the normal distribution. The weak law 
of large numbers states that the limiting distribution for the variables 
M, is the so-called unitary distribution—probability of 0 equal to 
unity, and probability of all other values equal to zero. 

In discussing the kinship of the central limit law and the weak law of 
large numbers, we should point out that the central limit law is the 
more potent of the two. By this we do not mean that the central 
limit law always implies the weak law of large numbers (see Probs. 14 
and 15 at the end of this chapter); but when they both hold, the central 
limit law gives far more information. The statement that a sequence 
of distributions tends to the unitary merely says that with increasing 
Probability the values of the variables cluster about zero, but it says 
nothing about how the cluster is distributed. For example, each of 
the following density functions describes a near unitary distribution: 


Fig. 26. 


142 THEORY OF PROBABILITY [cuar. 8 


However, only the first of these shows promise of being anything like 
normal after the change of scale S, = nM. n/[var(X,)]*. 

To be more specific, it is easy to show that Bernoullian sequences 
obey both laws. Therefore, if we toss a coin a large number of times, 
the weak law of large numbers tells us that there is a probability close 
to 1 that the number of heads divided by the total number of tosses 
will be near 14. In terms of distributions, this means that in a large - 
number of sequences of tosses we should expect most of the values of 
r/n to be near 14. However, for information as to the expected fre- 
quency distribution of these values of r/n, we must turn to the central 
limit law. 

While these comments fit the weak law of large numbers into its 
proper niche in an outline of the theoretical content of the limit 
theorems, there is another way of looking at it that accomplishes two 
important ends: (1) gives a picture of the practical interpretation of the 
law, and (2) suggests the background fora generalization to the so-called 
strong law of large numbers. 

Then variables zı, ta, . . . , æn define an n-dimensional event space 
in which each point is represented by a sequence of n numbers (its 
coordinates), each of which is a possible value of one of the variables 
zi Now, the inequality |Mn| < € defines a point set Ane in this 
n-dimensional space—the Set of all sequences whose arithmetic means 
differ from the number 


1 » ” 

2J a 

n 

izi 
by an amount less in absolute value than e. The point set An,e has - 
certain probability determined by the joint density (or probability) 
function for the n-dimensional space. The weak law of large numbers 
says that for every 7 > 0, if we look at the set Ane in a space soa 
sufficiently large number of dimensions, we shall find its probability t° 
be greater than 1 — 0. 
Suppose the variables T: represent the results of experiments. g p 

n-dimensional Space mentioned aboye represents the set of all possible 
sequences of results of the first n experiments. The set Ane represents 
all those sequences of results which satisfy the condition |Ma] < 6 


the weak law of large numbers holds, it tells us that, if n is large enough, 
the sequences of results of the An. type will form a large proportion © 
the set of all possible sequences of results. 

composed of sequences of numbers SUS 


for a description of the strong law ° 


SEC. 58] LIMIT THEOREMS 143 


large numbers. Suppose we consider an event space whose points are 
infinite sequences of numbers. In particular, we want to consider the 
Space of all such sequences a1, œs, œs, . . . , where, for each 7, a; is a 
possible value of the stochastic variable z; The details need not con- 
cern us here, but it is possible to define a probability distribution in this 
Space of infinite sequences in terms of the joint distributions for the cor- 
responding spaces of finite sequences. If this is done, then a set of 
infinite sequences will have a probability. The strong law of large 
numbers is concerned with the set of infinite sequences {a;} for which 


lim + > (a; — %) = 0. 
i=1 


Definition 4. The sequence 21, Tə, %3, . . . of stochastic variables 
obeys the strong law of large numbers provided that 
pr{lim M, = 0} = 1, 
Where å 


1 ny emt Spa 
M, = D (a — &). 


i=1 


The first thing we should observe about the strong law of large num- 
ers is that it says M,— 0 with probability 1, not I, — 0, period. 
he student is already familiar with the fact that probability 1 does 

not necessarily mean logical certainty, and (except for certain trivial 
cases) Probability 1 in the strong law of large numbers definitely does 
not mean certainty. Suppose (as is usually the case) there is some fixed 
number 6 > 0 such that each of the variables xi can assume a value 
Breater than ī:+ 6. Then, it is logically possible that every 2; 
assumes such a value, in which case M, > ôfor each n, and lim M, > 6. 
‘or example, it is theoretically possible that we get heads on every toss 
of a coin; therefore it is not logically certain that the ratio of heads to 
total tosses will tend to 14, though the strong law of large numbers tells 

Us that the probability of this latter event is unity. 
n a discrete event space this phenomenon of probability zero for a 
Bically Possible event does not occur; so in the strong law of large 
numbers we must be dealing with a continuous event space. This is, 
indeed, the case; but the student must not jump to the conclusion that 

MS are restricted to continuous variables for the z/s. 
© example of coin tossing points this out very effectively. A 


144 THEORY OF PROBABILITY [cHar. 8 


sequence of tosses of a coin may be represented by a sequence of 0's and 
1’s with the conditions that the jth and kth places in the sequence are 
independent if j = k and for each place the probability of 0 is 14 and so 
is that of 1. Now, an infinite sequence of 0’s and 1’s can be thought 
of as the dyadic expansion of a number between 0 and 1. A dyadic 
expansion is an expansion in powers of 14 in contrast to a decimal 
expansion, which is an expansion in powers of 149. For example, in the 
decimal system 

4 if 

T * To" 


en AO 
1047 = 75 +4 


ot 


in the dyadic system 
2.0% I~ J 
ANL =gtmtot ow 


Clearly, a dyadic expansion involves only zeros and ones. The dyadic 
.02 would mean (if anything) .10, just as the decimal .0(10) would mean 
-10. 

So let us think of our infinite sequences of heads and tails as dyadi¢ 
expansions of numbers between 0 and 1. In this simple case, the 
mysterious “space of infinite sequences” involved in the strong law 0 
large numbers can be represented as merely the unit interval. ot 
only this, but the probability distribution is a familiar one. If a point 
is chosen at random in the unit interval, the probability that the kth 
place in its dyadic expansion will be 0 is exactly 14; and the jth and kth 
places are independent for j < k. (See Prob. 33, Chap. 3, where the 
analogous property of decimal expansions is pointed out.) So i 
event space of infinite sequences of coin tosses is represented by the 
unit interval with a constant density function. Here each variable Xi 
assumes only two values, but the space of infinite sequences is a familiar 
example of a continuous event space. 

For the purpose of contrasting the weak and strong laws of large 
numbers, let us assume that our sequence space can be represented by 
the unit interval with constant density function. We need not stick t° 
dyadic expansions, though. The general picture is that we have 
stochastic variables M,—that is, functions M,(¢) defined over the 
event space0 <¢<1. The connection between this and our original 
description of the limit laws is that M,(t) is the average deviation irog 
the mean of the first n values in the particular sequence represente 
the on t. For this model the weak law says that for each fixe 
A S a for which |M,(¢)| > e has probability tending to best 

1g law says that thet set for which M(t) > Ohas probability . 

The thing to note about the weak law is that the ¢ set whose probabi z 


SEC. 58] LIMIT THEOREMS 145 


ity is tending to zero may move around in such a way that M,(t) does 
not tend to zero for any fixed ¢. Suppose 


1 for tin En 
M = | 0 otherwise 


where Z, is an interval of length 1/n with the Z,’s laid out end to end 
and lapping back when they reach the end of the interval: 


In this case, 


' 1 
pr{ M(t) #0} < id 0. 


However, X1 /n is a divergent series; so these intervals continue to 
lap back and forth indefinitely so that for no ¢ does M(t) become and 


remain close to 0. Thus, 
pr{ M(t) —> 0} = 0. 
So the weak law holds, but the strong law does not. 
On the other hand, the strong law implies the weak law. Referring 
again to the unit interval model, let e > 0 be given, and let B;,. be the 


Set of points ¢ such that |J/,(é)| < efor > k but not for n = k — 1. 
Now, M,(t) > 0 means that t must be in a B,,. set for some k; so if the 


Strong law holds, 


2 


or {> Bra} = 1. 


kat 
However, we have defined the B:,’s so that they are disjoint; thus, 
according to the addition principle, 


o 


> pr {Bre} Zh 
Therefore, 7 
lim pr { Bao} = fim) prlBad = 1 


146 THEORY OF PROBABILITY [cuar. 8 


Now, if ¢ is in the set 


it follows from the definition of the B’s that |Mn(t)| < €; so the result 
above tells us that 


lim pr{|Mn(d)| < e} = 1, 


and this is the weak law. í 
We suggested that the student follow the above proof with the unit 
interval model in mind with the idea that that might help him to geta 
picture of what was going on. We should remark in conclusion, zoe 
ever, that this proof can be completely divorced from the specific mode 


to constitute an argument that the strong law of large numbers always 
implies the weak law. 


59. The Law of the Iterated Logarithm 


If a sequence of variables obeys the strong law of large numbers, 
then, with probability 1, X, = o(n). The question then arises as i 
better estimate of the probable order of magnitude of the sums X me 
turns out that quite a large number of sequences of stochastic yorab 
satisfy the condition Xa = O([var(X,) log log var (X,)]*) with 
probability 1 and that for most interesting cases this is the best : 
relationship that can be obtained with probability 1. Specifically, 


it has been found that many sequences of stochastic variables obey the 
following law: 


Definition 5, 
obeys the law of t 
ditions hold with 


A sequence 24, 22, £a, .. . of stochastic a 
he iterated logarithm provided that the following cO 
probability 1: For every e> 0 

1 a | E , 

(1) [2 var(X,,) log log var(X,,)# a 
for all but a finite number of y 


alues of n, and 

a XL 

(2) [2 var(X,) log log var(X„)]* piss 
for an infinite number of v 


alues of n, 


e 
_, We shall not attempt to give any detailed discussion of the law of a 
iterated logarithm in this book. Theorems giving conditions for 


SEC. 59] LIMIT THEOREMS 147 


validity of the other limit laws are given in the next three chapters. 
Since there is no chapter on the iterated logarithm law, we might 
state here (without proof) the standard theorem: j 


Theorem I (Kolmogoroff). If the stochastic variables x1, Ta, ta, . . . 
are individually bounded by the numbers K, (that is, pr{|z,| > Kn} = 0 
for each n) and if 


K, = o(var(X,)[log log var(X,.)]-*), 
then the sequence {£n} obeys the law of the iterated logarithm. 


There is an example, due to Marcinkiewicz and Zygmund, to the . 
effect that if the o in Theorem I is replaced by O, the result no longer 
holds; but the surprising thing is that in this example it is (2) that fails, 
not (1) as one might suspect. In other words, if the values of the 
individual variables 2; are too big, the sums X, may be too small! 

There are many sequences of stochastic variables to which both the 
central limit law and the law of the iterated logarithm apply, and the 
Contrasting import of these two laws gives a very good picture of the 
notion of random sequences of numbers. Suppose we consider an 
Infinite sequence of experiments to which both laws apply. The cen- 
tral limit law says that the possible sequences of results are so arranged 
that, for every sufficiently large n, roughly two-thirds of them satisfy 
the condition |S,| < 1 and roughly 95 per cent of them satisfy the con- 
dition ISa] < 2. However, the law of the iterated logarithm says that 
for almost every individual sequence we have 

[Sal > (1 — [2 log log var(X,,)]# 
an infinite number of times. 


REFERENCES FOR FURTHER STUDY 
Feller, “The Fundamental Limit Theorems in Probability,” Bull. 
Amer. Math. Soc., vol. 51, pp. 800-832, 1945. 
Halmos, “The Foundations of Probability.” 


PROBLEMS 


1. Show that, for a Bernoullian sequence of trials, 


and 


148 THEORY OF PROBABILITY [cnar. 8 


2. State the weak and strong laws of large numbers for Bernoullian 
sequences of trials. 


Ans. Weak law: Givene > 0, pr f = 7 < e} Fi ds 


Strong law: pr = p} =1. 


3. Find the probability function for S, in the case of a Bernoullian 
sequence, and state the central limit law in terms of this probability 


— 1 22/2 
function. Ans. lim VPA" Crp teVapgprP tV mpagna—2V nM = ae g 
4. Find M, and S, for a Poisson sequence of trials (Prob. 12, 


n 


y 
Chap. 6). Ans. M, = Z =p, S,=r—- mo /(¥, ai) : 


izi 
5. State the central limit law and the weak and strong laws of large 
numbers for a Poisson sequence of trials. 


= è 1 2/2 
a r— np 24/2 dz. 
Ans, n’ m A rta < E 
Ans. Central limit law pr | = an so} Ji Vin 


Weak law: Given e > 0, pr | 


2 1 
a FS : 


Strong law: pr [Es o} =i 


8. Let each of the variables z, Xe, xa, . . . have a Poisson distribu- 
tion (probability function ae~*/x;!), and let them be independent. 
Setting 

X,= Ti 
Aa 
show that 
M, = = a 
and i 
gan 
Vna 


T. State the weak and str 
quences of variables in Prob. 6 


Ans. Weak law: 


en 
ong laws of large numbers for the § 


Given e > 0, pr | 


law: pr Xn og ai 
d = =i. 


n 
—— =a 


z 4 _, 1, Strong 


LIMIT THEOREMS 149 


8. Find the propability function for S, in Prob. 6, and state the 
central limit law in terms of this probability function. Hint: See 
P ~ z z (na)re+:V nagne Me 

ESk: Ui, Giap Be Hat m yin (na + z/na)! Væ 

9. Show that, if 2, 22, 23, .. . are independent and each is 
normally distributed, then the central limit law holds. Hint: See 
Corollary 1 to Theorem X, Chap. 5. 

10. Let ay, Xo, 3, . . . be independent and each normally distributed 
with variance unity. Setting 


env? 


i=l 
find M, and S, for the sequence {x7}. ` 
“Ans. M, == —1, S, = <=". 
n /2n 
11. State the strong and weak laws of large numbers for the squares 
of independent normally distributed variables. 


= - | < 7 — 1. Strong 


Ans. Weak law: Given «> 0, pr | 


law: pr JE 1} =1. 
12. In terms of probabilities for x°, state the central limit law for 


the squares of independent normally distributed variables. 


— b 1 
x n 29/2 
; < <bl— e? dz. 
Anie pr | S= fan | [ Vr 
13. Find the density function for S, in Prob. 10, and state the central 


limit, law for the squares of independent normally distributed variables 
in terms of this density function. Hint: See Prob. 13, Chap. 7. 


Ans, 


e2, 


i 1 Tp) 0/91 e-(n-+2V/2n)/2 = 
Jim FAGA (n + 2~/2n)° e z 
14. Let the stochastic variable x assume the values + ./2k — 1 
With probability 14 each. Let the xx’s be independent. Show that in 
is case Sna = Mn; so the central limit law and the law of large num- 
ers cannot both hold. Actually, the central limit law holds, and the 
law of large numbers does not. (See Prob. 3, Chap. 9.) 
15. Show that, if var(X,) = 0(n2), then the central limit law implies 
Xi Weak law of large numbers. Hint: In this case M, = S, - 0(1), 


n 1 E ETEA 
—e/0(1) Vin 


150 THEORY OF PROBABILITY ([cuar. 8 


16. Does the strong law of large numbers state that in a sequence of 
Bernoullian trials the number of successes minus the expected number 
will very likely tend to zero? 

Ans. No; compare the inequalities |(r/n) — p| < and |r — np| < ©. 

17. Assuming the central limit law to hold for Bernoullian trials, 
show that, no matter how large the constant K may be, 


pr{|r — np| < K} 50. 


18. Show from Prob. 17 that a gambler with a finite amount of 
capital who plays a perfectly equitable game has a probability of going 
broke arbitrarily close to 1 if only he plays long enough. 


CHAPTER 9 
THE CENTRAL LIMIT THEOREM 


_ As we pointed out in Chap. 8, the most general form of the central 
limit theorem for independent variables is that given by Lindeberg. 
We shall begin our discussion by stating the Lindeberg theorem with- 
out proof. This should at least satisfy the student's curiosity as to 
what the best possible condition for the validity of the central limit 
law looks like. While we have the Lindeberg condition before us, 
We shall derive from it certain corollaries describing some of the more 
useful cases of the central limit theorem. The chief virtue of the 
Lindeberg condition is generality, not simplicity; so the corollaries 
may well be of more value than the theorem. 

Turning to the question of proofs, we want to outline the method of 
characteristic functions. This was Liapounoff’s method of attack, and 
ìt remains the favorite tool of the statisticians in dealing with limit 
theorems generally. While a rigorous development of this method is a 
little too advanced for an elementary course, the formal work can be 
Carried through; and from this part of our discussion the student should 
Set an intuitive idea as to why the normal distribution appears as the 
limiting distribution for S,, in such a wide variety of cases. This is, 
Perhaps, the most important part of the chapter, because without these 
Sections the student has nothing but our word for it that the central 
limit law applies to anything except Bernoullian sequences. 

We shall conclude with a detailed discussion of the Bernoullian case. 

Sing essentially the method of de Moivre, we can give a complete 
Proof without getting above the analytical level on which we are trying 
to pitch this book. It might seem that we include this proof only 

cause we think that there ought to be a proof of something in this 
Chapter; but, actually there are two concrete results of this effort. 
A this case we are able to give an understandable estimate of the 
error term, and in the course of our proof we run across a very clear 
©xplanation of the need for the Poisson distribution as an alternate 
“PProximation formula in case p is small. 


80. Notation 


In Chap. 8 we saw that the central limit law was concerned with 
Sums of variables of the form ta — Z:. To simplify the notation, let us 
151 


152 THEORY OF PROBABILITY [cHar. 9 


make it a standard practice to consider these translated variables. 
Their distinguishing characteristic is that each has mean zero, but 
[see (b) of Theorem IV, Chap. 6] variances are not affected by this 
maneuver. As long as the variables 2; all have expectations, these 
translations can always be performed; so the hypothesis z; = 0, which 
appears in all our statements of the central limit theorem, is not restric- 
tive, only explanatory. F 

To avoid repetition, let us explain once and for all the principal 
symbols to be used throughout this chapter. 


' r ve 
xx: One of a sequence of stochastic variables. As noted above, W 
shall always assume Z, = 0 


fx: The density function for a, 
n 


Xan: tr 
2 
br: var(x,) 
Bn: var(X,) 
Sat Xn/WB, 


The classical central limit law (Definition 1, Chap. 8) cannot eyer . 
formulated unless B, exists. Therefore, we shall assume withou 
stopping to say so in each theorem that the second moments of the 
variables always exist. It has been found profitable (even in cases 
where B, exists) to generalize this definition by dividing the sums +” 
by something other than VB. However, for an introductory sone 
we have decided to confine ourselves to the Laplacian form of the cen 
tral limit law given in Chap. 8. jd 
There is one trivial case of the central limit theorem that we shou p 
dispose of before beginning a general discussion. If all the variables 5, 
are normally distributed, then by Corollary 1 to Theorem X, Chap: °» 
each of the variables S, has a normal distribution with variance wii 
so the central limit law obviously applies. Having noted this speti” 
case, let us now agree to exclude it from consideration in all furthe 


ea Saree Ng v 5 
discussions. This will simplify the statement of theorems in seve” 
instances, 


61. The Lindeberg Version—Informal Discussion 


As we pointed out in Chap. 8, Lindeberg gave a very interesting 


condition for the validity of the central limit law, and Feller ae 
some time later that this condition is the best possible for the cas 
independent variables, We state this result here without proof. 


SEC. 61] THE CENTRAL LIMIT THEOREM ` 153 


Theorem I (Lindeberg-Feller). If the stochastic variables x, are 
totally independent and each has mean zero, a necessary and sufficient 
condition that 


b 
lim pr{a < 8, < b} = al en de 
ne =|. 


is that, for every e > 0, 


y 1 ‘i 
(1) lim max. — __ 2fi(z)dz = 0. 
ate bon OWS el Se Ba 
There are a number of interesting observations that can be made 
about the Lindeberg condition (1). 


Corollary 1. If the Lindeberg condition holds, B, —> %. 


For independent variables, B, is the sum of the positive quantities 
br and so is monotone-increasing. Therefore, it either tends to œ% or is 
bounded, Suppose there were a constant M such that Ba < M for 
all n’s, Then, all we have to do is choose e so that 


E bı 
f see z'fi(z)dz > SH 


and the required maximum will be greater than }4 for all n’s. This 
Clearly contradicts (1). 

The converse to Corollary 1 is not true. If B,— ©, then, for each 
fixed k, the integral in (1) will tend to zero; but it does not follow that 
ihe maximum called for in (1) will tend to zero. We shall see an 


example of this in the next section. . ; A 
First, however, let us note some important special cases in which the 
condition Bọ, — o is sufficient for the central limit law to hold. 


i Corollary 2. If the variables z are independent and uniformly 
Sunded (i.c., there is a constant K such that pr{|x:.| > K} = 0 for 
every k), then the condition B, > © implies that the sequence {ær} 


°Deys the central limit law. 


J For each e > 0,e4/B,— œ% ; and as soon as e V B, > K, the integral 
™ (1) is zero for every k. Therefore, the Lindeberg condition is 


®Pviously Satisfied. 

„Corollary 3. If the variables 2, are independent and identical 

eco all have:the same density function) and have second moments, 
en the Sequence {x+} obeys the central limit law. 


154 THEORY OF PROBABILITY [cHaP. 9 


Since ‘the variables x; are identical, the variances b+ are all the same. 
Let us call them b. Since the variables x, are independent, we have by 
Theorem V, Chap. 6, 


Bn = nb— o. 


Therefore, as noted above, the integral in (1) tends to zero for each 
fixed k. Now, with the variables identical, the maximum called for in 
(1) is given by any term we happen to choose—given, say, by k = 1l 
for every x. So, if the individual integrals tend to zero, the maximum 
does and the Lindeberg condition is satisfied. 

Despite our comment in the previous section that we are auto- 
matically assuming that second moments exist, we made special men- 
tion of this in Corollary 3 to eliminate possible confusion. The oldest: 
known example of the failure of the central limit law (discovered by 


: ; i z t 
Cauchy) involves identical variables whose second moments do no 
exist. 


62. Examples—Failure of the Law 


In any discussion of the central limit law it should be pointed out 
that, while it applies to a wide variety of cases, it is by no means a 
versally applicable. In a later section of this chapter we E 
describe the more profound example of Cauchy. Here, we ig t 
give two rather trivial examples in which the Lindeberg condition 1$ a 
satisfied. It then follows from Feller’s result that at least the classi? 
version of the central limit law fails for these examples. 1 

The easiest way to construct such an example is to note Corollary 


to Theorem I and take a sequence of variables for which Bz is bounded- 
Let x, be defined as follows. 


m: —lgt g 
fm): Yee 


Here 


F ‘ 3 d 
If the variables are independent, Theorem V, Chap. 6, applies, aa 


SEC. 63] THE CENTRAL LIMIT THEOREM 155 


However, we said in the last section that we may have B, > œ and 
still not have the Lindeberg condition satisfied. An example of this 
Situation is given by the variables x+ defined as follows. 


ter ZFF 0 2+1 
fled: ye 1-4 ye 
Here 
1 s 1 =r be 
b; = oi” (21)? + aoa” (21)? = Qk; 
so 


B, = Qe = Q+ — A 
2 


However, the two nonzero values of x, are both greater in absolute 
value than B,, and the second moment of those two values is exactly bn. 


So, taking e = 1 in the Lindeberg condition and setting k = n in each 
case, we see that the required maximum is identically 1 


63. Normal Distributions in Nature 
There are many examples of physical phenomena for which experi- 
Mental evidence indicates approximately normal distributions. In 
biology, many physical measurements on living organisms, tabulated 
for a large population, exhibit a nearly normal frequency distribution. 
Tn Physics, the results of a series of attempts at the exact measurement 
Ol a quantity are usually very nearly normally distributed. In bal- 
listies, the pattern of hits from a large number of shots fired at a fixed 
target, usually forms something that resembles a normal distribution. 
n thermodynamics, the distribution of molecular velocities in an ideal 


gas is very close to normal. a eee 
On seeing these and other examples of normal distributions in nature, 


anyone familiar with the central limit theorem would quite naturally 
Suspect that it had something to do with the situation. The usual 
interpretation is that such effects are the result of a large number of 
essentially independent causes and so can be expressed as the sum of a 
arge number of independent stochastic variables, each variable repre- 
Senting the effect due to a single cause. This “hypothesis of ele- 
Mentary causes” is certainly plausible as a possible explanation of the 
Occurrence of normal distributions in nature. However, it is hardly 
JUStifiable as the explanation. There are other hypotheses than those 
of the central limit theorem that lead to a normal distribution. (See, 
or instance, Levy and Roth, Elements of Probability, pages 118 to 124.) 

he hypothesis of elementary causes is an interesting one, and it may 


i HAP. 9 
156 THEORY OF PROBABILITY [c 


PEES 
well furnish the explanation gt many of m E Fapa 
i ural phenomena. owever, the m i 
ar r ie se, limit theorem found so far are not those pee 
this idea. Furthermore, in many cases the most ek notin 
in favor of normal distributions is purely experimental and ha 
ith the central limit theorem. . 4 

A er thing we should bear in mind is that the ciate hypothe 
that a phenomenon is due to a large number of causes does ear pwn 
sarily lead to the central limit law. If there is sig es pry 
modified by a number of second-order effects, the ST ae 
over-all effect should be essentially that of the major ine = bo 
may not be normal at all. This is readily explained by Cor o. z iA A 
Theorem I. If the second-order effects are definitely just d Toe 
remains only a little larger than b, and the central limit theorem 
not apply. aal 

Finally, then, we might say a word or two about the use a ee 
distributions in statistics, In the next two sections we sha Anat 
two examples in which we can pull our way up via the a tO 
theorem from a sequence of more or less arbitrary distribu = a 
something that is normal or based on the normal distribution. Sakral 
are only by way of example. There are many other uses for pre 
limit theorem in statistics, However, there is another body of s iudied 
cal formulas based on the assumption that the population wer pene D 
is normally distributed. (See, for instance, Uspensky, Intro a a 
Mathematical Probability, Chap. XVI.) We can think of H orni 
reasons why any study of statistics involves a lot of work with i suse 
populations. One is that this is the field that has proved iG. The 
ceptible to investigation, and so there is more known about = for first 
other is the point we are concerned with in this section, oy aia: 
one reason and then another there are quite a lot of normal di 
tions among the things statisticians study. oh 

Statistical theories based on normally distributed popu 
not at all out of order as long as we realize that the normal nothing 
tion is only something that is frequently encountered, not s0 
guaranteed in all situations by the theory of probability. 


ns are 


64. Example—Sample Means 


Let us turn now tos 
actually prove for the 
that the process of sa: 
stochastic variables, 


ai em can 
ome of the things the central limit theor te i5 


; no 
statistician. The first thing we apne 
mpling furnishes an example of a sed 


ef, for 
: sor 
Suppose there is a population of some 


SEC. 64] THE CENTRAL LIMIT THEOREM 157 


which we are interested in a certain measurement connected with the 
individual members. For example, suppose we are interested in the ` 
height of American school children, the income of American families, 
the butterfat content of milk sold by a certain dairy, the number of 
defective cartridges per case in those manufactured by a certain com- 
pany, etc. If we pick a single item (i.e., school child, family, quart of 
milk, or case of cartridges) at random, there is a probability distribu- 
tion for the various possible results we may get on measuring this sam- 
ple item. That is, the result of an individual sample measurement is a 
Stochastic variable. So the usual collection of a number of such sam- 
Ple measurements forms a sequence of stochastic variables. 

Let us suppose that each sample measurement is independent of all 
the others. In sampling from a fixed population, this means random 
sampling with replacements—a practice not usually followed. We 
Want to use this independence hypothesis here, but this does not com- 
pletely invalidate our results. The degree of dependence of samples 
from a very large population is probably so small that it does not intro- 

uce appreciable errors. Furthermore, in the examples of sampling 
from a production line (milk, cartridges, etc.) the hypothesis of inde- 
Pendence seems reasonable anyway. 

The result of each sample measurement is a stochastic variable with 
a Probability distribution given by the distribution of measurements 
for the entire population. This distribution is unknown, of course 
(otherwise there would be no point in sampling); but in general it is 

unded and therefore must have a mean % and a variance o°. 

Let xı, 22, . . . , an be the results of the first n sample measurements. 
Each ær is a stochastic variable with mean Zand variance g°. Therefore, 


and gi : 
nd since the zy’s are independent, 


n 
var ( Xa) = no’. 
kel 


ince these are identical variables (even though their common distribu- 
ton is completely unknown), Corollary 3 to Theorem I applies, and 


n 


$ te — nF 


ei 
lim prla < #21 <b -f e772 dz. 
ie eS aaa a V2n 


158 THEORY OF PROBABILITY [cuar. 9 


Now, we let 


—G=h=* 


n, 
o ? 


and our result reduces to 


: | T l d 2 n 1 pni 
= \ te el~ 2/2 
G > i i -eVJ/n/o V 2r i 


k= 
for large n. 

The variable (1/n) Sx,—the arithmetic mean of the sample measure- 
ments—is called the sample mean. We thus have an estimate of the 
probability that the sample mean differs from the true mean Z by less 
than any given amount. Since e Vn/e—> o as n> %, this proba- 
bility tends to 1. 

This last result would follow more directly from the law of large 
numbers. The significant contribution of the central limit theorem 1 
that the sample mean is approximately normally distributed with 
expectation % and variance o*/n. Furthermore, this is true no matter 
what the distribution is in the population from which we are sampling- 
Thus, if we have some information on the size of c°, we can find from & 
table of the normal distribution the probability that the sample mean 
will differ from the true mean by any given amount. 

We shall return to this problem in Chap. 11, where we shall show 
how the law of large numbers can be used to justify the usual estimate 
of o? from the sample data. Just now we might note a very si mple = 
in which we can get a lot of information without knowing the ern 
value of o°, Suppose our samples are a Bernoullian sequence a ie 
That is, the result of each sample is either yes or no. The samp € 
mean is then r/n, the ratio of the number of yeses to the number © 
samples. The true mean is Pp, and so we have 


r Vile 
pr fk = pS e ~ / oe e7? dz. 
=i i/o Vir 


= pg < 1; so 


Now, in this case o2 
—2e/n Vr 


If we note, for instance, that according to the tables 


r 
n p 


2 
1 

Se OHO dz.c .95, 

Ls Vir ° 


SEC. 65] THE CENTRAL LIMIT THEOREM 159 


we see that with probability at least .95 we have the experimental ratio 
within e of the theoretical probability provided that e y/n > 1. For 
example, we get an estimate of p correct to within .01 with probability 
-95 provided that .01 y/n > 1, or n > 10,000. This is taking the 
maximum value for c?. If there is good reason to believe that p is 
noticeably different from 14, we can improve this estimate. 


65. Example—The x? Test 

The x? test of significance for sample data is a standard tool in 
Present-day statistical practice. A complete derivation of it is a little 
Over our heads in this course, but we have developed enough of the 
basic ideas to give the student some conception of what is involved. 

The probability distribution of a certain population is assumed to 
be known, Then samples are taken from this population. Now, we 
Should not expect the frequency distribution of the samples to follow 
the theoretical distribution exactly, but we want to get some measure 
of the deviation of the sample distribution from the theoretical one 
that will give us an idea of the consistency of our theoretical assumption 
with the experimental results. 

If the total range of the sample values is divided into m intervals, 
the theoretical distribution gives us a probability pit = 1,2, ...,m) 
that a sample will fall in each of the intervals. So the expected number 
in the ith interval out of a total of n sample values is np:. If r; is the 
actual number of samples falling in the ith interval, the sum of the 


Squares of the deviations 

m 

y (r: — mpi)? 

i=1 
Constitutes a reasonable measure of the discrepancy between theoretical 
and experimental distributions. 


Tt turns out that things work better if we use 


2. V iano. 
m= npi 
i=l 


Now, the problem is to find the probability distribution for x?. If we 
et 


Ti — NPi 
n= 
anpi 


We have 


is 


m 
x? = X vi. 
i=1 


160 THEORY OF PROBABILITY [cHap. 9 


This reminds us of Prob. 15, Chap. 7, in which we had the sum of 
squares of independent normally distributed variables. Each 2; is just 
a linear function of r;, the number of successes in n independent trials 
with probability p; of success on each trial; therefore, as n > %, the 
individual z/s are asymptotically normal. However, they are not 
independent. If we remember that 


That is, m — 1 of the z/s determine the other one completely. A 
thermore, since this relation is a linear one, it tells us that the vi 8 6 
lie in an (m — 1)-dimensional hyperplane; #.c., their joint distribution 
is (m — 1)-dimensional. d 
As we noted above, the individual z/s have distributions that reno 
to the normal. It does not follow that they are asymptotically Egi et 
mal correlation, but it is not surprising that this is the case. This 4 es 
shall not attempt to prove. Tt can be obtained from a multidie 
sional generalization of the central limit theorem. Now, in pene 
X, Chap. 5, we proved that two variables in normal correlation nae 
be linear combinations of some pair of independent normally ape 
tributed variables, This, too, can be generalized to any number ti 
dimensions with the result that in their limiting form the variable 
contained in an (m — 1)-dimensional space, must be linear combi 


tions of m — 1 independent normally distributed variables Us %? 
vy Umar. f the 
Again we omit the details, but it turns out that, because of the 
strategic choice of cons ae 


: tants we made in defining the a;’s, eac 
variables w; has variance unity and 


i=] i=1 
z ares 
So, asn — ©, the distribution of x? tends to that of the sum of sdU 
of m — 1 ind 


"i $ 1 
ependent normally distributed variables. By ProP 
Chap. 7, 


lim prix? > x3} = J “grom a ee 
n> o xê 2-D/2T|(m psan 1)/2] 


SEC. 66] THE CENTRAL LIMIT THEOREM 161 


This x? distribution has been tabulated extensively, and from these 
tables we can find an approximation to the probability that a set of 
samples will deviate (in the sense of this test) by more than a given 
amount from the theoretical frequency. 

The student should bear in mind that this is an asymptotic formula, 
valid only for large n—we shall not attempt to work out an estimate of 
how large. Suffice it to say that most authorities agree that it should 
be used only if n is so large that the expected number np; of samples 
Per interval is at least 10 for every interval. The student should note 
that this does not require a large number of intervals. As a matter of 
fact, the fewer intervals'there are, the fewer samples it takes to get the 
expected number per interval up to 10; though, of course, the fewer 
Intervals we use, the less information we get. Finally, we might note 
that, if the number m of intervals is very large (and the number n of 
Samples sufficiently larger), the formula in this section becomes 
unwieldy; and we can use the central limit theorem again—this time 
taking a limit on m. For an outline of this procedure, see Prob. 8 at 


the end of this chapter. 


66. Characteristic Functions 
If f(x) is the density function for a stochastic variable x, we define 
the characteristic function for x as 


(1) g(t) = ie eit? f(x)dx. 


Th the light of Theorem I, Chap. 6, we can say that g(t) is the expecta- 
tion of ce, The characteristic function is the principal tool in the 
Liapounoff proof of the central limit theorem. That proof will be 
Outlined in the next section. Here we want to indicate as best we can 


© properties of characteristic functions that will be needed. 
First, we might note that every distribution has a characteristic 


function, The integral defining g(¢) has to be absolutely convergent 
“cause the integral of f(x) is, and e'* is bounded. The second impor- 


tant thing is this: 


Theorem II. The characteristic function uniquely determines the 
Istribution, 


As a matter of fact, in all cases (discrete, continuous, and otherwise), 
the distribution function F(u) = pr{x < u} is given by the formula 


a itu ita 
P 1 eo 
@) F(u) — F(a) = Jim 5, a p(t)dt 


162 THEORY OF PROBABILITY [cuap. 9 


whenever u and a are points of continuity of F. A rigorous proof of 
Theorem II is beyond the scope of this book, but we might make it 
seem reasonable by going through the formal computations for the 
special cases with which we are familiar. If there isa density ee 
f(x), it is the derivative at u = x of F(u). Without attempting: po 
Justify the procedure, let us take the limit as T — œ% and the ie ee 
with respect to u both under the integral sign in (2). Then, the sub- 
stitution u = 2 gives 


fa) = xf” eeotnae 


, sarana iustinca- 
By continuing to ignore the finer points that need rigorous justifica- 
tion, we can give a formal indication that this last formula is corr 


k k = 
J ep (t)dt = | ans | cite f(x)dx dt 
-=k -k msh 
2 k 
a f ji ete f(x)dt dx 
mest f= 
= oita- |k 
 J-ei@=y) 


sled 
-k 
í ji * 2sin ke — y) gage, 
=z w =p 


Letting w = k(x — y), we have 


f MeO = lim | 28inw, G + v) au: 


k> o J= o w 


, ‘ ally 
If we can take the limit under the integral sign (a step that is o aar 
hard to justify in this particular case) and if f is continuous, the apP 
tion of (1), Sec. 55, gives us 


[7 eat = 2 fly). 


rly 25 
In the discrete case, the proof is much simpler, but it is not er 
convincing because the reduction of (2) to a formula for the proba! is t0 
function is not at all obvious intuitively. The best we can = x are 
state it and indicate the proof. Let us assume that the values x “now 
all integers. Then the probability function—which we shal 
write f(n)—is given by the formula 


Sin) = 3 a e™™e(t)dt. 


sec. 66] THE CENTRAL LIMIT THEOREM 163 


To prove this, we note that 


o(t) = Y SE); 


k 


So, multiplying by e~™ and integrating, we have 


io 7, eint f(k)dt. 


If the sum is a finite one, there is no argument; but, even for an infinite 
series, it is not hard to justify integrating term by term in this case. 


So we have 
y tlk) f citt dt = Qn f(n) 
k 


because by (6), Sec. 54, all the integrals vanish except the one for 
° = n, and it is equal to 27. 

We have gone to all this trouble to make Theorem II seem reasonable 
to the student because, if each characteristic function uniquely deter- 
mines a distribution, then it is not unreasonable to suspect that the 
limit of a sequence of characteristic functions would determine a dis- 
tribution function which is the limit of the corresponding sequence of 
distribution functions. This is, indeed, the case, but we shall not even 
try to prove it here. We merely list for reference the following: 


of characteristic functions converges 
the corresponding sequence of dis- 
determined 


Theorem III. Ifa sequence Yn 


o ner é 
te a characteristic function g, then p à 
Tibution functions converges to the distribution function 


Y p. 


If we accept Theorem III, we are well on the road to proving the 
central limit theorem. All we have to do is show that the character- 
istic function for S, converges to the characteristic function for the 
Normal distribution. Now, Sn is the sum of the first n x’s divided by 

ni so we have three more preliminary questions: (1) What does 
© characteristic function of a sum look like? (2) What happens to 
© characteristic function when we divide a variable by a constant? 

What is the characteristic function of the normal distribution? 


endent and have characteristic 


Theorem IV. If x and y are indep sees : 
gzis the characteristic function 


uncti : 
nctions ø; and ga, respectively, then 91 
rety, 


164 THEORY OF PROBABILITY [emar. 9 


By Theorem V, Chap. 5, e"? and ci” are independent. Now, 
et) = elev; so the characteristic function for x + y is just the 
expectation of a product of independent variables, and the result fol- 
lows from Theorem III, Chap. 6. 


Theorem V. If y(t) is the characteristic function for x, then ¢(t//) 
is the characteristic function for x/k. 


By Theorem VIII, Chap. 5, the density function for u = 2/k is 
g(u) = k f(ku); so the characteristic function for u is 


all) = ae ek f(ku)du 

I etke thu) du 
i 

e(a} 


F ee : istribu- 
Theorem VI. The characteristic function for the normal distribu 
tion is e~#/2, 


This is proved by a direct computation from (1): 


e(t) = L = ez e277? dg 


-2 us 


= I 
a em Vlz—2itz) ay 
Í -o Vr 
2 1 , 
= em 2itz—004.12) dy 
J - o Vln 


sete er p 
= e-t/2 3 ein: dy 
=o V aT 


= gt 


67. Application to the Central Limit Theorem 


" ; imit 
Let us begin our discussion of general proofs of the central li 


. š : jca 
theorem by considering the case (Corollary 3 to Theorem I) of ident 


. . i a 
variables with second moments. Though this seems like 2 vite 
special Case, the student should see from the example on sample r, 
that it isa very important one. 


studyin€ 

is 

about samples (means or anything else), the set of sample values 

set of identical stochastic variables. i 
So let x1, ta, Ta, . . . be independent and have identical ia 


Indeed, whatever we may be 


tribu- 


SEC. 67] THE CENTRAL LIMIT THEOREM 165 


tions with second moments. As usual, we assume the common mean 
is zero. Instead of b; we shall now write b inasmuch as the xxs all 
have the same variance. Furthermore, we let ¢(¢) be the character- 
istic function for z,—also the same for all k’s. If we compute the first 
two derivatives of ¢ (differentiating with respect to t under the integral 
sign), we have 

g0 = [T ize” fladde, 

oe") = tn —xteit f(x) da. 


it: 
The existence of the first two moments and the boundedness = F j 
us that these integrals converge; 80 ¢ has two deriv ap es, a f, 
; A A 
be shown that they are continuous. At t = 0, we ge 


etz = e = l; 


e0) = [7 Iod = 1, 
e'(0) = [7 ixfleddr = 1 = 0, 
"(0) = jie —x2f(x)dx = —b. 
Therefore (Prob. 21, Chap. 7) 
E 
gt) = 1-3” + ol). 


In this problem, 


‘ r rem V 
S0 we want the characteristic function for x,/v/nb. By Theo , 


this is y i 
t ie ka 

-1-£+0(5) 

(Fa) oid 


unction for Sn is the nth power of 


By Theorem IV, the characteristic fi 


this; z 
i e Aro gp 
[+(sa)l -[-at 
N 
by Prob. 25 Chap. 7. The application of Theorem VI and Theorem 
mea ile 


: 5 n 
I now completes the proof of the case stated in Corollary 3 to 
Theorem i 


166 THEORY OF PROBABILITY [cHap. 9 


This simple case illustrates the general idea, but we might as i 
look at a rough outline of the Liapounoff theorem. If the xp's are no 
identical, we have different characteristic functions; so we call ene 
git). Proceeding as above, we get ¢:(0) = 1, ¢,(0) = i: = 0, an 
¢k(0) = —b;. Thus, on dividing by ~/B,, we have 


t be, 
alg) = 1-55 + 


So the leading term in the expansion of log ør reads 
t bil? 
tog ø (5) = agp F chee 


For the characteristic function of S, we want 


La 
ee \ BR) 
k= 
so for the logarithms we want 


n 


l A he Pauni 
log m= m ki E E E AEE 
» OF Sr (se) ` 3B, * 2 


k=l ket 
because by Theorem V, Chap. 6, Ba = Eby. , (care- 
Again, we have the right answer provided that the error terms s are 
fully omitted above) behave right. The details of that argumen the 
only tedious, but we might indicate what is called for to make the 
errors behave. If we assume the existence of third moments Tor iih 
ges, we get third derivatives for g;; and the Maclaurin formula = 94, 
remainder gives the error in the expansion of gx(t/~/ B,) as Oort? h ii 
where |6| < 1 and ¢ = E(\x,|*). By some routine computation $ g 
an inequality on moments due to Liapounoff, we can get an estjoni 
the same order of magnitude for the error in the logarithm of r just 
the total error is the sum of these, and Liapounoff’s hypothesis © well 
what we should expect—that this sum tends to zero. It might ay 
to state this theorem formally. 


quence 


Theorem VII (Laplace-Liapounoff). If x, ze ra.» 18 2 SON ip 


of independent stochastic variables, each having an absolut 
moment cr, and if 


SEC. 68] THE CENTRAL LIMIT THEOREM 167 


then 


n> o 2r 


b 
lim pr{a < Sna < b} = Í L e= dz. 


We have outlined here the proofs of two forms of the central limit 
theorem, that stated in Corollary 3 to Theorem I and that stated just 
now in Theorem VII. Though the method of attack is roughly the 
same for the two cases and though the first form seems much simpler, 
the student must not get the impression that it is a special case of 
Theorem VII. In the case of the identical variables we needed only 
n the more general case that is not sufficient. 
orollary 3 to Theorem I 
but not from that of 


Second moments, whereas i 
The form of the central limit theorem given in C 
follows as a corollary from the Lindeberg theorem, 
Liapounoff. 


68. Example—Cauchy’s Distribution 


: Cauchy’s example of the failure of the central limit law consists of a 
Sequence of independent variables, each having the density function 


1 
Ie) = 774 ey 


Tt turns out that the characteristic function for this distribution is 


g(t) =e". 


A direct calculation of this by (1), Sec- 66, calls for analytical tech- 
niques beyond the scope of this book, but we can at least indicate that 
1S is the case by applying the inversion formula to g(t) and seeing 


that, we get f(a): 


1 * i 1 E ie lt + i eiiz dt 
Dt an ety (b)dl = On = e a m 
1 i c 


=e |” 


cto it= PN. ) 
abl aL. +i lo 


1 1 1 

Qn (a t T+ z) 
Lis, 

SaF 

the characteristic function for 


si y > 
Now, by Theorem IV, 


n 


y= 


k=l 


AP: 9 
168 THEORY OF PROBABILITY [CHAP 
is 

lð = em, 


Therefore, by Theorem V, the characteristic function for 


is 
erlin — e, 


3 . f the 
the same as that for each of the xxs. So the arithmetic mean 0 


f individual 
first n variables has the same distribution as that of the indivi 
variables. 


As we noted in Sec. 42, these variables do not have second men 
—they do not even have first moments. So the classical formu poe 
of the central limit law is impossible. However, the striking | ae 
about this example is that no averaging process leads to a norma 


T ensity 
tribution. Let {æn} be any sequence of constants. Then the de 
function for 


is 


Qn 
nall + (one /n)*] 
and under no circumstances does this tend to (2r) e. 


69. Bernoullian Case—Detailed Proof 


If the variables x 


jals, then 
+ represent a Bernoullian sequence of trias, 
(see Sec. 48) 


k=1 
B, = var (2 we) = npg. 


à jals 
Presents the number of successes in 7 trials, 


n 
r= ) a. 


k=l 


then W° 
Tf, as usual, r re 


have seen that 


SEC. 69] THE CENTRAL LIMIT THEOREM 169 


Therefore, the variables X, and S, of this chapter are 
Ka = t — Mp, 
S, = (r — np)/ v npa. 


So this special case of the central limit theorem reads as follows: 


Theorem VIII. For a Bernoullian sequence of trials, 


r— np € L n 
pr la < ee SH = — ¢ dz + o(1). 
~ Vapa ~ | i v2 


bi The usual proof of this theorem is very straightforward. 
bility function for r is well known: 
JO) = "Cor. 
to the factorials in ”C, and the 
directly to the result. Let 


The proba- 


v ae, > 
Das Stirling theorem approximation 
Obvious changes of variable lead us 

r— inp 


2= 
then V npa 


(1) r=np+z npa, 
2) n — r = ng — 2 VIPA 
If 2 is bounded (and, indeed, we want a < z < b), then r and n — r 
oth tend to « as n does. So the Stirling formula applies to all three 
factorials: 
igi 
fe) = 


“a mA 
netep g E D, 
re(n — pyre s/m 
1+ 0(1) 


= i Reet a 
cS Ce = j / rnp 
np ng 


1 
log f(r) = — log \/2mnpy — (- Ẹ a log (5) = (n =r 5) 


log Ca ‘) + o(1). 


Now 
Ow, from (1) and (2), respectively, We have 


PF Ja, 
a np 


ee eee E, 


ng ng 


170 THEORY OF PROBABILITY [cHap. 9 


Therefore, 
1 EA 
log f(r) = — log vV 2mnpq — (np + zv npa + y log ( +2 JS) 


= (ng —2zVnpg + 3) log (: =2 i) + o(1). 


Our next step is to expand these last two logarithms in series. Here 
we have direct evidence of the significance of the size of p and q. a 
series expansion 

ee ae x 
log(l +a) = a= Fa pa 
is convergent only for |z| < 1 and not very rapidly convergent munis 
|x| is very small. Now, our variables, z(q/np)* and —2(p/ng)"*s ure 
small for n sufficiently large; but if either p or q is very small, ‘there 
should be a better approximation for intermediate values of n. ‘This 1 
the role played by the Poisson formula—sce Chap. 10. 


is 
Our present concern is with the limit theorem, however; and to thi 


ajn > $ are valid 
end we note that, for n sufficiently large, the series expansions are V 
and we have 


log f(r) = — log \/2rnpq 
2 34 1) 
- (m-e vam + i) [e 2ta own) | +0 
= — log Wr2npq — 3 + o(1). 


= 1 p—2/2 
f(r) owe {1 + o(1)]. 


y: 

eps n : r : r upg)? 
As r runs through Integer values, the increments in 2 are (npa 
therefore we could write 


Jr) = 


Now, as n = %, Az 


1 
V 2r 
— 0; so 


z<b 


prila <z <b} = ja (oe e772 All + o) 


z>a 
b 2/2 AZ. 
= iA =, e2 dz + o(1) + Hi o(1e =e 


e? Az[1 + o(1)]. 


BEC: 69] THE CENTRAL LIMIT THEOREM 171 


The error term in (3) depends on z, but for a < z < b it is bounded, 
and its upper bound tends to zero as n— % ; therefore we can write 
|e? Az o(1) | < [20 Aelo(1) = o(1), 


whence 


(4) plasesth= | 


a T 


e dz + o(1). 


This proves the limit theorem. It also furnishes us with the proof 
promised in Chap. 7 that the constant in Stirling’s theorem is ~/2r. 
The logarithm of the coefficient of e? in (4) is just the negative of 
the constant Cz of (3), Sec. 53. To make the integral of the density 
function in (4) unity, we must have this coefficient equal to 1/27; 


therefore Cz = log \/2r. 
re The question of the error term in (4) is 
The proof we have given of Theorem VIII (basic 
does not lead to anything very interesting in the way of ane 
this error, Uspensky (Introduction to Mathematical Probability, Chap. 
VII) has adapted Laplace’s proof of Theorem VIII to get such an 
estimate. Rather than attempt any such procedure here, let us turn 
Our attention to (3) and the error involved there. Equation (3) gives 
an approximation to the probability for a single value ofr. The 

osson formula (Chap. 10) gives another approximation to the same 
thing. So the most direct comparison of the normal law with the 
Poisson law is given by comparing the error in (3) with the error term 
1n Theorem III, Chap. 10. : , 

Feller has pointed out that (3) does not give the best first-order 
aPproximation to Jr). The form of (3) is, of course, suggested by the 
classical formulation of the central limit law- What Feller suggests is 


that instead of the variable 


a rather complicated one. 
ally that of de Moivre) 
stimate of 


r — np 
pa DC 
Vnpd 


_rom@tipet z. 
“= Va + ipa 


This leads to the same limit theorem, but the error in the 


—u2/2 


IO) ~ Torin + DP 


we P . 
Consider the variable 


approximation 


is Perry aii 
‘much neater than that in (3). We present here a slight modification 


F 
eller’s argument: 


‘HAP. 9 
172 THEORY OF PROBABILITY [CHAP 


Theorem IX. The probability of r successes in n Bernoullian trials 
is approximately 


si af + 1 4]? 
iene“) 


The ratio of the correct result to this approximation is 


(Q= pu (p> + gut wQ x | 
Sk [+ 6v (n+ 1pg 384+ 1)pg ` 4n + 1)pg 


where 
ap a Utp 
V(x + 1)pq 
p° 


7 q0, l — 
i Tm. i a) 


0<4<1, 0<@<1, 


and 


|R| i 


L 
h +F 
24n + 1)pg + ulg — p) Vin + pq + upg] 12” + 


ve 

The proof of this theorem parallels very closely the proof we m 3 

given for Theorem VIII. Instead of giving all the details of cit 
tion here, let us merely refer back to the steps in the proof of The 


we 
VIII and indicate the changes to be made. In place of (1) and @)* 
have 


©. r+j=@+Dp+u vaT 


q , 
= TENERA E 


© 2-95 = HD- wye z 


p e 
= (n+ (1 —u m 


i + ade bY 

Before applying Stirling’s formula to f(r), we multiply and divide 
n + 1 to get . 

= ™+1)!prqr — ah 

an (+ Trin — 7)! 


re 
)! we use Stirling’s formula as given by T 
and (n — r)! we use the form given in Prob. 9, 


m Í, 
hap: 


Now, for (n +1 
Chap. 7. Forr! 


SEC. 69] THE CENTRAL LIMIT THEOREM 173 


7. The resulting approximation to f(r) then simplifies to 


Bf 


+16 Bln —r+ ears 
v ZET] 
wla + DP GE Dp @+ Na 


Now, we take logarithms and substitute from (5) and (6). 


log S(t) ~ log +/2r(n + Ipaq 


— + p(i+e tg) 8 (ER 
— (n+ 1)q ( = ule) log ( =E J) 


At this stage, we can see why this seemingly more complicated form of 
the problem leads to a simpler and more efficient estimate of the error 
term. In the proof of Theorem VIII, we used a series expansion for 
Something of the form (1 + x + y) log (1 + a): Here, it is simplified 
to the form (1-++2) log (1+ 2). The Maclaurin formula with 
remainder gives 


gt n a 624 
log (1 +2) =£- 5 +37 40 +a) (0<@< 1). 


Theref ore X 
g m ki 


(7) (l+2)log(+2)=2+5-Gt3 dae 


It is now only a matter of routine computation to complete the proof 
of Theorem IX. We use (7) twice, first with 


mea Jet 
t= UN m+ Dp 


then with 


Bes 
v= -u [m + 1) 


Then 1)q, respectively, and 
Wi i = and —(n + 1)q, resp ' 
add, Ae by ee ae . the second-degree terms give us 


Ww x 
pr ere we compare it with the Poisson a 
© Point we should like to make here. 


4 AP. 9 
174 THEORY OF PROBABILITY [CHAP 


term is 
=p) 
6 V + 1)pq 


which means that in general for fixed (or at least bounded) u the loga- 
rithmic error is O(n). However, all the other terms in the o 
are O(n™'); and for the one special case p = q = 14 the leading B 
vanishes. Therefore, for p = 14, not only is the logarithmic es 
smaller than otherwise; it is an infinitesimal of a higher order. a 
student should note this observation cuts two ways: (1) The sentra 
limit law approximation is best in the Bernoullian case for p= z 
(2) Estimates of the error for p = 1⁄4 are of no significance at all 
other values of p; they are of the wrong order of magnitude. 


REFERENCES FOR FURTHER STUDY 
Cramér, Mathematical Methods of Statistics, Chaps. 15-17. aadi 
Feller, “On the Normal Approximation to the Binomial Distribution, 
Ann. Math. Stat., vol. 16, pp. 319-329, 1945. 
Levy and Roth, Elements of Probability, Chap. V. “Tv XV. 
Uspensky, Introduction to Mathematical Probability, Chaps. XIV, AV- 


PROBLEMS 


1. Unit electrical charges are placed at the integer points on the ae 
tive half of the realline. A unit negative charge is placed at the orig io 
By the inverse square law, the force on the charge at the origin ee 
the charge at the point kis + 1/k’—the direction of the force dapen oe 
on whether the charge at k is positive or negative. The signs O vith 
charges at the positive integer points are chosen independently Y the 
probability 14 for each sign at each point. The total force canta 
charge at the origin is the sum of the forces due to the indiv? not 
charges, but the probability distribution for this total force aed 
normal. Show that the Lindeberg condition fails because of Corot. 
1 to Theorem I. 

2. Suppose the unit char 


t the 
integer points, but at the p 


r ‘ ta 
ges of random sign are placed, no as in 


oints 1/k—other conditions the same ive 2 
Prob. 1. Show that Corollary 2 to Theorem I now applies to 8! 
normal probability distribution for the total force. «able 
3. Show that the Lindeberg condition is satisfied by the varia tra 
Prob. 14, Chap. 8. It now follows (as noted there) that the cen” 


. . pk 
limit law holds for these variables but the weak law of large 2U™ 
does not. 


sin 


THE CENTRAL LIMIT THEOREM 175 
4, pa 
the a fate 8, = 12, Chap. 8, we asked the student to state 
Verify by onsen ‘or each of three different sequences of variables. 
inka ary 3 to Theorem I that each of these sequences obeys 
i ee $ Chap. 8, we asked for a statement of the central limit 
thisinas age sequence of Trinls, The law does not always hold in 
it dons — se Corollary 2to Theorem I to find conditions under which 
8. There A Ans. Xpiqı = ©. (See also Probs. 21, 22 below.) 
tisi nie > sequence of urns Uy, Us, Us, ++ +, The urn U; con- 
ae ne ball and k — 1 black balls. One ball is drawn from Ui, 
the Aubaea bes ete. Show that the probability distribution for 
7. With th white balls inn draws is asymptotically normal. 
made from sige of urns in Prob. 6, 2 draws (with replacement) are 
balls oo each urn, and success for cach urn is defined as 2 white 
of Picea teal Show that the probability distribution for the number 
8. P : aah asymptotically normal in this case. 
hes a a rob. 4 above and Prob. 12, Chap. 8, it follows that, as 
m of degrees of freedom increases, the distribution of 


x? — m 


Vf 2m 


tend 
e xe es the normal, From this derive R. A. Fish 
est when m is large. According to Fisher, for large m, 


ig JE — V2m—1 
Sap j R 
Proximately normally distributed with vari 


VI < V2m—-1 +2 


er’s substitute for 


ance unity. Hint: 


ie: 
quivalent to 


tə 


e<m-ztevm-i+ty 


As m — 
fore > œ, this last expression is asymptotic to m + z +/2m; there- 


pr E < al ~ priv 2 < võm=1 if: A. 
m 


9. 5 
> Su n n” 
nu Ppose the size of an organism 1 


mber of į s due to the effect of a large 
AAt order dependent random impulses tı, Y2, - + +3 x, acting on it in 
Ven im, uring its lifetime. Suppose, however, that the effect of a 
Size Pulse depends not only on the magnitude of the impulse but on 

of the organism at the time of the impulse. In particular, let 


176 THEORY OF PROBABILITY [cuap. 9 
Zr be the size achieved after the kth impulse and assume 


Zr = Zra F Zr. 
Then, 


n 


PE Zr — Zra = ai dZ 
eR Zam Z 


k=1 k=1 


Assuming Zo = 1, show that the final size Z, of the organism should 
have an approximately logarithmiconormal distribution. Hint: See 
Prob. 24, Chap. 5. 

10. Individual incomes (within certain ranges) follow a reasonable 
facsimile of a logarithmiconormal distribution. Devise an explanation 
for this modeled after the argument in Prob. 9. ; 

11. By applying the central limit theorem to a sequence of variables 
each of which hasa Poisson distribution, prove the following asymptote 
estimate for partial sums from the exponential series: For large œ, 


ksatiVa 


a Peale 543 2/2 
Rime Í EMA de. 
k>atiVe "v2 
Hint: In Prob. 8, Chap. 8, substitute a = na, k = na +2 v/na. 5 
12. In Chap. 4 we described a random walk in one dimension m ee 
of a sequence of stochastic variables. Show that the central ma 
theorem applies to these variables to give the result that after & larg 
number n of jumps 
Vn 1 


ajya WV 2 


13. Find the effect of a translation on the characteristic sate 

Ans. If ¢(t) is the characteristic function for x, that for 4 — 
eta (t), i 

14. Let z = z — Z, wh 
Bernoullian trial. 


pria < x <b} œ~ e*n dz, 


i in a single 
ere z is the number of successes 1n & v is 
Show that the characteristic function for 4 


E 2 
qet A peit = 1 — pge + 06). 


15. From the result 


, a tion 
3 in Prob. 14 show that the characteristic fune 
for S, in a sequence of 


Bernoullian trials may be written 


e is 
[: -z+ O(n) | . 


THE CENTRAL LIMIT THEOREM 177 


16., Apply Prob. 25, Chap. 7, to the result of Prob. 15 above to prove 
the central limit theorem for Bernoullian trials. 

17. Let x be chosen at random between —14 and 44. Find the char- 
acteristic function for a. Ans. g(t) = (2/t) sin (¢/2). 

18. Prove the central limit theorem for a sequence of independent 
variables, each distributed as in Prob. 17. Hint: Show that the 
characteristic function for Sn is 


[ames 11 - £4 00-9]. 


_ 19. Let x have the density function e~#(« > 0). Find the character- 
istic function for x — Z. Ans. eli) = e#/(1 — it). 

20. Prove the central limit theorem for a sequence of independent 
Variables, each having the distribution given in Prob. 19. 

21. Show that, if x, represents one of a Poisson sequence of trials, the 
absolute third moment of x; about its mean is p.g.(p? + 92) < pede. 

22. Show from Prob. 21 that, for a Poisson sequence of trials, 


n 


B7” Di c: S H pith) 3 


Morefore the Liapounoff theorem gives the same result for this case as 
> Lindeberg theorem does. (Compare Prob. 5 above.) i 
tion - Find the logarithmic error term of order n= in the approxima- 
Cary (3), Sec. 69. Note that to get all terms of order n, we must 
Pre : $ he Maclaurin expansions for the logarithms—immediately 
eding (3)—one step farther than they are carried in the text. 
24 Ans. (q — p) — 32)/6 / npg. 
Order Compare the result in Prob. 23 with the error term of the same 
er in Theorem IX. 
Fie ps Designating by e the answer to Prob. 23 and by 7 the error 
ax order n= in Theorem IX, we have, for z = 1, |n| = |e|/2; for 
lel 6 Kilu, However, forz = 4/3, |e] = 0; and, for (roughly) 
2) lel < Jnl. 
show Noting the nA of the first two error terms in Theorem IX, 
De that, if p > 14, the approximation is better for u < 0, while if 
A; it is better for u > 0. 
~ Show that if .1 < p < .9andn > 1,000, the logarithmic error in 


The 
o ne : 
Slog IX is given correct to two decimal places by the first term 


178 THEORY OF PROBABILITY [cuap. 9 


27. Find the probability that in 2,500 Bernoullian trials with p = 19 
the experimental ratio r/n will be within .02 of p. Answer the same 
question if p = 4. Ans. .9544, .967. 

28. In 2,500 Bernoullian trials with p = 14 what is the best upper 
bound on |(r/n) — p| that can be obtained with probability .95! 
with probability .68? Ans. .02, 01. 

29. Answer the questions in Prob. 28 for the case p = 14. 

Ans. .019, .0095. 

30. In a Bernoullian sequence of trials with p = 14 how many trials 
does it take to have |(r/n) — p| < .01 with probability .95? with 
probability .68? Ans. 10,000; 2,500. 

31. Answer the questions in Prob. 30 for the case p = 4. 

Ans. 8,800, 2,200. 


CHAPTER. 10 
THE POISSON DISTRIBUTION 


Phe chapter we want to present two important interpretations of 
asson probability function a’e~*/r!. One is that it furnishes an 
“pbroximation to Bernoullian probabilities for the case of small p. 
" 1S is the classical interpretation, and we shall investigate it at 
Some length, determining how good an approximation the Poisson 
ormula furnishes and comparing it with the normal law approximation 
‘re IX, Chap. 9). However, in the modern theory of proba- 
a 3 there is another interpretation in which the Poisson distribution 
oe ss as the exact solution to a certain general type of problem. To 
iso a representative picture, the student should see this point of view 
o; and it is presented in Theorem IV. 
eet Ra examples in the chapter and those among the problems at the 
a give the student some idea of the importance of the Poisson 
es, but we might add that there are other topics in the 
ern theory of probability not even mentioned in this book in which 


the Poiss . 
sson formula plays an important role. 


nA 
0. The Binomial Limit 


wee indicated in Chap. 8, the Poisson probability function appears 

a), Te of the sequence of pseudo-Bernoullian probability functions 

asis “Mi That fact and a slight generalization of it can serve as the 

50 We hi of the developments we want to present in this chapter; 

Obey, st turn our attention to the proof that the sequence (1), Sec. 57, 
YS the Poisson limit law. 


Th 
e Sg 4 T 
orem I. If ais a constant and r is a fixed positive integer, 


r n=r esas 
; a are 
lim "C, a) as == 
1 n= o rj r r! 
OU) Gite ccs a : 
this requires is straightforward computation: 
a\" 
j lar mns 
yon 9 ( = ay nia (: «) 
n ay 
rl — r)a = = 
ri\(n — r)In ( z) 


179 


z HAP. 10 
180 THEORY OF PROBABILITY [cu 


a 


e(1-2) ni E EE) 


T ay 
7 3 _a@ 
n (1 2) 


Now, 


. h 
and in the second fraction there are a fixed number r of factors — 
numerator and denominator with each factor tending to unity a wed. 
fore, this entire fraction tends to unity, and the theorem 1s agi nore 

In order to present the fundamental steps with a minimum iv, the 
fusion, we assumed in Theorem I that a was a constant. Actua ab e 
same result (with the same proof) holds if we replace a by a va ty. 
which tends to a as n= œ. In other words, in place ee the 
a + o(1); in place of a/n, write (a/n) + o(1/n). We leave i and we 
student to check that this does not affect the above argument, a 
list for reference the following: 


Theorem II, As n => o, 


CEROT- -areo 


71. Approximation to Bernoullian Probabilities 


, ; roximar 
It is a good idea to have an estimate of the Eror an AUY ete the 

tion for la; but this Seems particularly desirable here, bec 

Poisson formula is used as 


e same for the Poisson law. 


Theorem III. If} is a positive integer and a > 0, then 


r = “(qt — 1) r] 
ag (2 _a arema r—(a—r)? , r(a = “a 
ad () (: *) ue ry SP | 2n $ mE 


SEC. 71] THE POISSON DISTRIBUTION 181 


Where 


r(dr +3) , a(n — 1) 
12(n — 7)? © 3(n — a)? 


r nor 
ni(& ise 

n n TES g 

= e 


|R| < 


Let us write 


Then, ri(n — r)! a 
dane n(n — r)! 


On . ac 
applying Stirling's formula, we have 


nor 
en” tiem» (1 = A enn 
n 


n(n — r) teen- 


therefore 
(1 i 
) P= 4-94 (x7 log (1-4) - @ =) toe (1 -£) 


1 r 
— 5 lo ( = 4 = 
Where 2 8 n a 
W. E = Enor — én 
e 
lar oy (1), See. 53, taking the smaller estimate for en- and the 
estimate for €n, and have 
(2) 


0<, z 1 1 z 2r- +1 . 
2m — r) 12m +3) 12(n — r)(2n + 1) 


N 
Somer We expand the logarithms in series. In each case we have 
form ing of the form log(1 — x), where 0 < x < 1. The Maclaurin 
throy X wit remainder tells us that, if we carry such an expansion 
8h the term in z*-1, the remainder is 


ak qk — gt 
Rage lost -bh)= ka — 0! 


182 THEORY OF PROBABILITY [cuar. 10 


where 0 <i <x. Clearly, such remainders are negative and less in 
absolute value than 


ak 


kG — ay 


Bearing this in mind we write 


a a a? , 
-amq -i= u-n- 2-9) 


(3) oe a = 2ar ar 


Spee Qn 2è i 
where 
(4) 0<s=(™m~ rë <An- | a(n = r), 
a 3 nF a)? 
3n? ( ra 2) (n 
2 
Similarly, 
—(n~r)] 1-2). im Pa e} 
¢ ) oz( z (n — r) (: + Ome +n 
a <po 
2n m? Ty 
where 
(6) O <a = ry < Mn — 2) _ oo y 
3 3 —r)* 
an (1 = 2) (n 
Finally, n 
7) oe ae 
wher Se) <9) > mm tT 
e 
(8) 0<r<i__r _. đ P” ; 
a(i id zy 4(n — r}? 
n 


Substituting (3), (5), and (7) in (1), we have 


st =g SE os 
ie a = ee 


i y n 
D eerie Proof except for the estimate of R. It is easily 8° 
a 


Zink 7(8r + 4) 
metl er $4). 
12m — (Qn F 1) & 12(n — r)” 


SEC. 72] THE POISSON DISTRIBUTION 183 


so by (2) i 
rr + 4 
0<e< Ran? 


However, from (6) and (8) we have 


Therefore, 
(3r + 4 
hntr el cies. 
es: o. (n — r) 
V, we bring in (4) and have 


IR] = r?(3r +4) | a(n —r) 


$ ntre lS per =a Ia] < 12(n =F * 8G a) 
This cc 


mpletes the proof of Theorem Ill. 

i Corollary l. The probability of r successes in n Bernoullian trials 
S *PProximately 

(npyre—r 


rl 


he ratio of the correct result to this approximation is 


ay (TR? _ (r — np)? Po 
exp ( 7 + Inž +2 


Where 2n 2n 


rr +4 1 , (py 
tic m < Fay +g — (2): 
" This follows immediately from Theorem III by substituting a = np. 
Dy parison with the Normal Law 
l; dove timates of error given in Theorem IX, Chap. 9, and Corollary 
POxime ole us to tell, for any given problem, how accurate each 
ation is. So if we have a problem in which there is some 
c ipite n which formula we should use, the best thing to do is to 
OWeyer he error for each one and see which is the more accurate. 
r itisa good idea to have in mind a general description of the 
this a P roblem to which each formula is applicable. The purpose of 
ie 'S to give the student some such general description. 
Orreg case we have described the error by giving the ratio of the 
This is Sult to the approximation in the form of an exponential. 


“Proxim bg Convenient form for reading off relative errors. The 
~ mation 


exi +z 


184 THEORY OF PROBABILITY [cuar. 10 


indicates that the exponent should give us a good idea of the relative 
error. For numbers that we should expect to find in error terms, this 
approximation is really quite satisfactory. For instance, e! = 1.105, 
e% = 1.051, e° = 1.01005. So, in studying these error terms, We 
shall compute the exponent and call that the relative error. 

For any Bernoullian sequence the normal law approximation gets 
better as increases; so the natural way to describe its applicability 
is to find a minimum n for a given degree of accuracy in each situation. 
The accuracy of the normal law depends on the values of u and p as 
well as on the value of n. It is natural to speak of the value of u 
rather than that of r—though, of course, we have to compute u from 
T, n, and p—because the tables of values of the normal distribution 
function are always in terms of u. We include here a few computa- 
tions from the error term in Theorem IX, Chap. 9, foru = land u = 2 
—that is, for r one and two standard deviations from its mean. Values 
of p are given at the left. The desired degree of accuracy is given ai 


r top. The entries in the table are (rounded off) minimum values 
of n. 


z wW 
Taste I. NUMBER or TRIALS REQUIRED ror Given Accuracy or Normat DA 


p 10% | 5% | 1% | 0.1% 

=o 2 4 20 200 
“e 3 5 50 5,000 
$ 4 10 225 22,500 
A 7 | 25 650 65,000 
1 | 20 80 | 2,000 200,000 
.05 | 50 200 5,000 500,000 
.01 300 1,200 30,000 3,000,000 


7 AW 
Taste II. NUMBER or TRIALS REQUIRED ror Given Accuracy or NORMAL L 


(u = 2) 
Se ee ee 


4 10% 5% 1% | 0.1% 
3 EM 60 300 3,000 
3 pa 140 3,200 320,000 
2 150 600 15,000 1,500,000 
a 400 | 1,600 40,000 4,000 ,000 
os | 320 | 5,200 130,000 | 13,000,000 
Di 200 | 13,000 320,000 | 32,000,000 
: 20,000 80,000 2,000,000 200 ,000 ,000 


SE z 
Bee 72] THE POISSON DISTRIBUTION 185 


In the case of the Poisson law, the situation is somewhat different. 
pe is not an asymptotic formula for Bernoullian probabilities; so there 
a E n in each case as well asa minimum. In addition to the 
kon TEU imposed by the question of accuracy, there is the practical 
ie 3 eration of the usefulness of the formula itself. If np is too 

k. ge, values of r near the mean will be too large for the formula to 
„Practical. There is an excellent set of tables of values of the 
Bok approximation in Molina’s Potsson’s Exponential Binomial 
ie el York (1942). These tables are for values of a(= np) from 
Ces ‘in 100. Let us take this as the range of usefulness of the 
Rina, There still remains the question of accuracy; and this 
Son n, p, and the deviation |r — npl. Tables III and IV give 


Ta 
BL 7 SON r 
aI, Numper or Trias Requirep ror GIVEN Accuracy or Porsson Law 


(lr — np| = 5) 


i 10% 1% 0.1% 
Se eee 

a 150-200 

05 125-1,000 | 300-700 — — 
01 125+ 250+ 1, 250+ = 
-001 1254+ 250+ 1,250+ 12,500 + 
-000 * * 1,250+ 12,500+ 
00001 * * 1,250+ | 12,500+ 


T. 
AB 
Ty, Numen or Trrars Requirep ror Given Accuracy or Porsson Law 


(r — npl = 10) 


4 10% 5% 1% 0.1% 
Ey P 
es 600-1 ,000 
pe 500+ 1,000+ 6,000 + = 
inor 500+ 1000+ | 5,000+ | 50,000+ 
sooo * 1000+ | 5,000+ | 50,000+ 
00001 * * 5,000+ | 50,000+ 


Oe E a Dewar Bc coat 

Sty 
lowin ates from Corollary 1 to Theorem III in accordance with the fol- 
Cann K conventions: In some cases the indicated degree of accuracy 
© attained; we have marked these with a dash (—). In some 
Mating? are both upper and lower limits for n within the range of 
Sa “a eo les; for these we give both limits. In some cases there 
the ran r limit that should be noted while the upper limit is beyond 
Barg olina’s tables; for these we give the lower limit followed 


186 THEORY OF PROBABILITY [cnar. 10 


by a plus (+). Finally, in some cases the indicated degree of accuracy 
is guaranteed for all applicable entries in Molina’s tables; we have 
marked these situations with an asterisk (*). 


73. Example—Telephone Trunk Lines 


Suppose a large corporation is planning to have a ree ae 
system installed. They will have a fairly large number n of naonana 
telephones, but each will connect into the company switchboard, Ta 
not nearly so many outside lines will be required. Tf data are availab e 
on the average proportion of the time each telephone is in use for oute 
side calls, we can make a good guess as to the number of trunk boat 
needed for efficient service. Let p be the probability that an individua 
telephone is using an outside line at a given time. To estimate such 7 
probability, we set 60p equal to the average number of minutes pe! 
hour per telephone spent on outside calls, i ane 

Now, we think of the n phones as n independent “trials.” Fac 
trial consists of the question of whether or not a certain telephone 
requires an outside line at a certain time. The variable r (which W * 
ordinarily think of as the number of “successes”’) then represents i 
number of trunk lines in demand at a given time. The service Sie A 
be reasonably efficient if the probability is fairly large (say .95) that A 
number 7 of lines in demand does not exceed the number available. 
That is, if k is the number of lines to be installed, we want 


prir < k} > .95. 


e e <k 
To estimate this probability from the normal law, we note that” > 
means 


r—np < k — np, 


V npa ` /npq’ 


of this latter inequality is approximately 


1 (k—np)/ V npa 
Vor e” dz. 


According to a table of the normal 
the upper limit is 1.65, 
efficiency if the number 


and the probability 


95 if 


ee a ral is 
distribution, this integ! al ent 


-e 
Therefore, we get approximately ga: pe 
of trunk lines is 
k= np + 1.65 V/npq. 


, ret © 
In general, p is rather small in this problem; so we could hope t0 Fo 
reasonable estimate from the Poisson formula, This procedure 1n“ 


a 


SEC. 75 i 
c. 75] THE POISSON DISTRIBUTION 187 


choosing k so that 


ow AR 2 105 
r=k+1 


ree telephone engineers prefer this solution to the problem because it 

tive nee computation. Molina’s tables include a section on cumula- 

is p 9 abilities such as the one above, and the required value of k can 
cad off directly from the tables. 


74, 
à Example—Counting Bacteria 


tes ted the bacteria in a given culture the usual procedure is to 
aa ns substance in which they are contained until a single drop of 
hos xture will contain on the average so few bacteria that they can be 
ed by direct observation under the microscope. If we keep track 
© proportions used in diluting and the size of the drops studied, a 
Simple arithmetic takes us back from the average count per drop 
z count per unit volume in the original culture. 
distripa at that interests us here is that there isa Poisson probability 
© mien on for the number of bacteria observed ina single drop anden 
SUCCegs de ope: We let the individual bacteria represent tolels with 
the Prot efined as “contained in the drop under observation.” Clearly, 
ough p ability of success on a single trial is extremely small—small 
ility of or the Poisson approximation to be very accurate. The praba 
and if Sn in the drop is apparently the same for each of the bacteria; 
1S in, intone that the probability of success for a single bacterium 
emoutlicn of the presence or absence of others in the drop, we have 
US the rine, sequence of trials, and Corollary 1 to Theorem III gives 
tence remarks furnish a test of the bacteriologist’s experimental 
lue. The record of a large number of observations should show 


little 
to th 


&ssents, 
Observant Ly Poisson frequency distribution for the count on a single 
drop w lon. Otherwise, the procedure used in mixing and selecting a 


0 
7 uld be open to question. 


‘ i “ct Poisson Distributions 

18 fo n = the most profitable uses of the Poisson probability function 

Umber > Problems concerning the probability of occurrence of a given 

Prob a Events of a certain type within a given time interval. For 

from wW : A this kind we can formulate some rather general hypotheses 

steiga it will follow that the Poisson distribution is the actual 
lon, not merely an approximation. 


188 THEORY OF PROBABLLITY [cnar. 10 


Suppose we are interested in the occurrence within a given time 
interval of events of a certain type (e.g., disintegrations of atoms 1n 
a given radioactive source or claims against an insurance company). 
Let us distinguish three different possibilities: 


(a) Exactly one event of the given type within the given time t. 
The probability of this we shall designate by p(t). ? 

(b) No events of the given type within the given time t The 
probability of this we shall designate by q(t). 

(c) More than one event of the given type within the given time £ 
The probability of this, we shall designate by e(t). 


Let us now make the following hypotheses—not necessarily provable 
in any particular application, but certainly eminently reasonable: 


(1) The probabilities p, q, and e depend (as indicated by the nota- 
tion) only on the duration of the time interval, not on the time it 
begins. r 

(2) For disjoint time intervals the occurrence of (a), (b), or ie 
one interval is independent of the results in the other intervals. 

(3) 4(0) = 1, and q has a continuous derivative with q'(0) = E 
Note that intuitively this says that in time zero we expect no events 


and as t increases, the probability of no events decreases. 
(4) e(t) = o(t) ast 0. 


Theorem IV. Subject to the hypotheses (1), (2), (3), and (4) pot i 
the probability of exactly r events within a time interval of length 


P(r) = a 


Let us divide the 
each of length t/n, 
-contain one event e: 
potheses (1) and (2)] 
r events is to hay 
of this latter pos 
val, therefore le 
Thus, 


given time interval into n disjoint subintervas 
The probability that r of these subintervals a 
ach and the others none follows [because © eb 
the Bernoulli formula. The only other way ne 
e more than one in some subinterval. The prob aa 
sibility is, by hypothesis (4), o(1/n) for each sup ne. 
ss than or equal to n - o(1/n) = o(1) for at ai 


Poe OLOT +0 


SEC. 76] THE POISSON DISTRIBUTION 189 


Now all we have to do is note that by hypothesis (3) and Prob. 21, 
Chap. 7, z 


therefore 


v 
—™ 
Sie 
L 
[i 
| 
a 
zi 
SS 
| 
m 
31 
ce 


and we have 


Pay = 0, [+0 OIl ra fpo elm + o(1). 


a the unspecified o functions properly chosen, this holds for each n; 
_ JUS it holds for the limit as n > %, and the desired result follows from 
eorem IT, 


8s Example—Radioactive Disintegrations 
bie ss emission of œ particles by a radioactive source furnishes a good 
te ration of the uses to which Theorem IV can be put. The disin- 
erion of a single atom of radium (for instance) is accompanied by 
© emission of a single a particle. This process seems to take place 
Purely by chance so that we cannot say that under given conditions a 
ee radioactive source will emit a certain number of a particles in a 
dig ian of time. However, we can say that there is a probability 
emissi ution for each of the various possibilities (no emissions, one 
Ks a more than one), and the hypotheses (1), (2), 6) and (4) of 
with th can be advanced as reasonable assumptions in connection 
Proba; ese Probabilities. Therefore, by Theorem IV , there isa Poisson 
iven Sity distribution for the number of a particles emitted by a 
s Se within a specified time interval. ii 

Can actu = of an instrument called the Geiger counter, the p ri ; 
Adjusted « ly count individual a particles. If the source streng p 
Ma Te 50 that the average number of a particles entering the counter 
Count ibig time interval (say 10 seconds) is something easy to 
feq į ned 5 or 6), then by reading the dial on the counter at the speci- 
lantiy Coe the experimenter can get a large number of readings ona 
tion, H (the number of hits) which has a Poisson probability distribu- 
MAES sh t follows that the frequency distribution for the various read- 
ould be essentially that of the Poisson formula. 


t . 
umns out that Geiger counter experiments actually do give such 


190 THEORY OF PROBABILITY [cuar. 10 


results, and this can be taken as experimental evidence in favor of the 
assumption that the hypotheses (1), (2), (3), and (4) of Sec. 75 are 
applicable to the problem. 


TT. Two Views of the Same Problem 


We have presented the problem on telephone trunk lines as & 
Bernoullian sequence with a Poisson approximation and that of the « 
particles as a strictly Poisson case following the hypotheses in Sec. 75. 
Actually, we can look at each of these problems in the other way. 

To fit the telephone problem to hypotheses (1), (2), (3), and (4) of 
Sec. 75, we consider not the number of telephones in use at a given 
instant but the number of telephones beginning a call during a give? 
time interval. Looked at in this way, the telephone problem seems 6 
fit these hypotheses very well; and if we take as the time interval under 
consideration the average length of a call, we get essentially the same 
information. 

To see the @ particles as a Bernoullian sequence, we let a single ator 
represent a trial and define success as disintegration during the specifies 
time interval. Then the assumptions that the probability of disinte- 
gration is the same for each atom and that the atoms act independently 
give us a Bernoullian sequence with very small p. So the answer 18 ie 
same as before. i 

Now, one of these analyses gives the Poisson formula as an approxi- 
mate answer, while the other gives it as the exact answer. This means 
that the two sets of hypotheses are not completely equivalent, an! 
who is to say which set is the correct one? Experimental evidence 
indicates something like Poisson distributions for these and many ee 
phenomena, but it can never be proved experimentally that the PoissoP 
either is or is not the exact distribution in any case. The discrepancy 
between approximate and exact solutions is certainly within the sai 
of experimental error and chance fluctuation in statistical data. e 

Essentially, it boils down to a question of which explanation ki 
prefer, and the present-day tendency is to use something on the 0” e 
of the argument in Sec. 75 whenever possible. 


__ REFERENCES FOR FURTHER STUDY 
oi mney and Its Engineering Uses, New York (1928 


ý Chap: 
PROBLEMS wa 
1. Let a certain volume be divided into n equal subdivisions, 

Suppose there are k particles distributed in this volume in such 2 


THE POISSON DISTRIBUTION 191 


that for each particle the probability of being in a given subdivision is 
the same for all subdivisions. Suppose, further, that the probabilities 
for one particle are independent of the positions of the others. Show 
that for n and k both large and of the same order of magnitude the num- 
ber of particles per subdivision has essentially a Poisson distribution. 

2. Formulate a set of hypotheses modeled after those in Sec. 75 but 
phrased in terms of particles distributed in space as in Prob. 1. 

_ 3. Fit the problem of counting bacteria into the model of Prob. 1; 
to that of Prob. 2. 

4. There are many examples that parallel the problem of counting 
bacteria almost exactly—counting white blood corpuscles, counting 
Yeast spores in suspension, counting grit particles in lubricating oil, etc. 
Add some more examples to this list. 

5. Ifa map of London is divided into equal squares and the locations 
of hits by “buzz bombs” during 1944 recorded on the map, show that 
there should be something like a Poisson distribution for the number of 


uts per square, 


Problems 6 to 12 list a number of phenomena which might reasonably 
be expected to have Poisson distributions. In each case formulate the 
Problem specifically as a sequence of Bernoullian trials and as a situa- 
tion satisfying hypotheses similar to those in See. 75. Criticize each 

°rmulation, giving your opinion (with reasons) as to the correctness of 
e assumptions made. 


8. The number of nuts in a chocolate almond bar. 
T. The number of typographical errors per page in a newspaper. 
8. The number of times per day your telephone rings. 
; 9. The number of calls per day answered by the fire department in 
à large city. 
10. The number of tornadoes hitting within a county in a certain 


Lumber of years for the counties in the ‘tornado belt” of the Middle 


est. 


11, p i : 
of 1, lhe number of meteorites that 
* time, di 


strike the earth in a given period 


12 T i , 

2. The C von AIT \ in a given 

Periog ia of twins born in à givell wily hospita b 

onda other examples of natural phenomena that s 
l distributions. 

limig, 2 Prob. 6, Chap. 9, we took 1 ball from each urn and found a 


Mitin 
We tale Normal distribution for the number of white balls. Suppose 
= f 1 draw from the 


bag : oe 
a sequence of sequences of trials consisting 0 


> hould give 


HAP. 10 
192 THEORY OF PROBABILITY lan 


first urn, 2 independent draws from the second, . . . k hdupen oi 
draws from the kth,.... Show that as k— œ the probabi a 
function for the number of white balls on the kth sequence ten 

Poisson probability function. : 
pa In statistical saedliantin each “microscopic system” a = 
other fundamental unit) may be in any one of a large — a 
“states” (e.g., energy levels). In the so-called classical statistics i i 
assumed that all states are equally likely and that the state of i 
system is independent of those of the others. Show tiat pubens 
to these assumptions the number of microscopic systems in a e s 
scopic system” (large collection of microscopic ones) having a g!ve 
state should follow a Poisson distribution. the 

16. In the matching problem (sce Sec. 33) we found that 

probability of exactly r correspondences was 


k=0 


Show that, asm — «o, this tends to a Poisson probability function. 


17. A deck of cards is shuffled; then the cards are turned up one ere 
time, each card being discarded after it is turned. The player er 
the cards of the deck in order (AS, KS, . . . ,28, AH, .. . ,2C); a 
ing a card each time he turns one. Show that there is a Poisson pr? 
bility distribution for the nu 

18. In the draft lottery i 
drawn from a fish 


Poisson function so rapid] 
made between Probs. 17 
being matched is 52 in one case and 10,000 in the other. = 0, 
20. Find the valu isson probabilities for a = 3 and? 3 
++ +10. (Use Molina’s tables, if available.) Draw a grap jian 
21. What is the probability of 100 successes in 10,000 Bernoul i a 
trials with p = 01? Hint: Use Stirling’s formula, Ans. -01 
22. Find the characteristic function for the Poisson distribution- 


Ans. expla(e* — 


1). 


THE POISSON DISTRIBUTION 193 


23. Find the characteristic function for the number of successes in a 
single Bernoullian trial. ` Ans. q + pe. 
24, Using the results of Probs. 22 and 23, prove Theorem I by the 
method of characteristic functions. 
25. Prove Theorem II. 
26. Using hypotheses (1) and (2) Sec. 75, and considering the subdi- 
vision of the time ¢ into equal subintervals, show that q(t) = [g(t/n)}*. 
27. From Prob. 26, hypothesis (3), Sec. 75, and Prob. 26, Chap. 7, 
show that g(t) = e”. 


CHAPTER 11 
THE LAWS OF LARGE NUMBERS 


4 ite 

The laws of large numbers (particularly the weak law) hold red 

a wide variety of cases. We shall not attempt to give a sep 
description-of-all of these, Instead, we shall prove one or two 


x some 
theorems concerning the validity of each law and then turn to s 
examples. 


78. Proof of the Weak Law 


o 
. . i ars can be 
Many interesting cases of the weak law of large numbers 


derived from the following rather trivial lemma. 


Lemma 1. If {y,} isa sequence of stochastic variables with 


lim var(yn) = 0, 
then, for every fixed e>0, 


lim pr{ly. — gl > e} = 0. 
ie 


This follows immediate 
VI, Chap. 6). Let of = 


5 rheorem 
ly from Tshebyshefi’s inequality (The 


var (yn); then, 
1 
Pr{lyn = Gal > ton} < E 


Setting e = lon, we have 


Prilyn = Gul > e) < = 
and for fixed €, the 


Turning to the | 
of stochastic vari 


right-hand side tends to zero by ie 
aw of large numbers itself, let us take a sequen 
ables and let 


B, = var (2, za). 
One of the simplest 


1 Proofs of the weak law depends on the be 
B, and nothing more, Let us begin with that. 
194 


{we } 


srl’ 
havio! É 


SEC. 78] THE LAWS OF LARGE NUMBERS 195 


Theorem I. If B, = o(n?), the weak law of large numbers holds. 


This follows immediately from Lemma 1. For the y» of the lemma, 
we use 


n 


Mn = = » (te — Tr). 


k=1 


From (c) of Theorem IV, Chap. 6, we have 
var(M,) = 2 


SO our present hypotheses imply those of the lemma, and the result on 
applying Lemma 1 is exactly the weak law of large numbers—see 
Definition 3, Chap. 8. 

The student should note that Theorem I can be applied to dependent 
aS well as independent variables. In dependent cases the behavior of 
> may be difficult to determine; but if this can be done and B,/n*— 0, 

en the weak Jaw applies. Let us turn now to some corollaries in 


“2 conditions for the weak law are stated in terms of the individual 


Vari : ividu 
We sh i tı instead of in terms of the variance of the sum. As ae 
Cha i designate var(x,) by by. From Corollary 2 to Theorem be 


p are uncorrelated by pairs, 


» We see that, if the variables zx; 
e a number of useful character- 


* In these cases we can giv 
of the conditions under which the weak law holds. 


4 r= 
lations 
each of 


7 ire 
the ene T. If the variables x are uncorrelated by pairs, ah 
° lowing conditions is sufficient for the weak law of large numbers: 


(c) The Variances b, are uniformly bounded 
@ T © Variables x, are uniformly bounded . 
me variables x, are identical and have second moments 


ie ratit 
bles ne results all follow immediately from Theorem I. If the varia 


ti J independent we ca se the method of char acteristic 
3 can u: e 


first no to show that in the case of identical variab 
8 sag ee is sufficient for the weak law to hold. (See Prob. 14 at 
of this chapter.) 


196 THEORY OF PROBABILITY {cwar. 11 


79. Convergence in Probability 

The weak law of large numbers involves a special case of what is 
known as convergence in probability. If y1, yo, Ys, ... isa sequence 
of stochastic variables and a is a constant, we say that yn converges in 
probability to a provided that, for each e > 0, 


lim pr{lyn — al < e} = 1. 
ne 


Clearly, in this terminology the weak law of large numbers says that 
Mn, converges in probability to zero. 4 

Convergence in probability has many of the formal properties of 
ordinary convergence of a Sequence of numbers or functions, but these 
do not follow just because we have used the word “ convergence.” We 
must prove them. The student should note that each of the proofs 
below begins just like the proof in any calculus book of the correspond- 
ing property of ordinary convergence of sequences, but slightly differ- 
ent considerations are necessary to finish the proof in each case. 


Theorem II. If y, converges in probability to a and z, converges 
in probability to b, then 


(a) Yn + zn converges in probability to a + db. 
(b) Yuen converges in probability to ab, 
(e) Yn/2n converges in probability to a/b provided that b # 0. 


To prove (a), we note that 


Ge +e) — (4d) = — a) +, — 8); 
so 
Ia +) — (0+ 8) < lye — al +e, — BI, 
Now, the usual ar 


Yn DEN 
implies that Ia +2 =t Bl Se 
lyn ~ al >s or ln — 1 > 5 


(or both), Therefore, by (d) of Theorem I, Chap. 2, 


SEC. 80] THE LAWS OF LARGE NUMBERS 197 $ 


Pr {llyn + Zn) — (a + b)| > e} < pr fiv» — aj >5or lzan — b| > s) 
<p {iv — a>} +o [laoa 


The last inequality is obtained from (e) of Theorem I, Chap. 2. Now, 
each of these last two probabilities tends to zero; therefore the first 
one does, and the theorem is proved. 

The proofs of (b) and (c) proceed along exactly the same lines once 
We get the necessary inequalities set up. For the products we start 
With the identity 

Ynn — ab = (Yn — @)(2n — b) + a(zn — b) + blyn — a). 
From this it follows that 

lYn2n — abl < lyn — allen — b| + lalla — | + llyn — al, 
and again the probability that the left-hand side is larger than e may 
be shown to be less than or equal to the sum of probabilities each of 
which tends to zero. 

The identity we need for (c) is 

U a (Yn — a)(b — Zn) + a(b — 2n) 3 (Yn — a), 

ĉr bo b? + blen — b) U a 
From this it follows that, for |zn — b| < lbl, 
Yn a al|b — znl + lallb — Zal te lyn — al. 
a p b? — follen — bl || 
The left-hand side may be too large because |z, — b| is BO large in 
Comparison with [b| that the first fraction has a small denominator; or it 
may be too large because one of the numerators on the right is large. 
Each of these possibilities has probability tending to zero, and the 
result follows in the usual manner. 


zs 


3o. Applications to Sampling 

The ideas in this chapter find a great number of applications in 
"he theory of sampling. If the student wants a thorough treatment of 
his Subject, we must refer him to a treatise on statistics. In this 
section we shall indicate only a few results that follow more or less 
*asily from the theorems we have proved. ; 
P €t us recall the picture (see Sec. 64) of a sequence of independent 
andom samples from some population. First, let us organize the 


bern 
“"minology and notation we want to use. 


198 THEORY OF PROBABILITY [cuar. 11 


The population has: 


A distribution function F(z)—equal to the probability that the 
measurement being studied is less than or equal to x 

A probability (or density) function f(x) 

A mean 

A variance g? 

Other moments in case we need them 


The individual samples are independent stochastic variables xx, each 
having the distribution function F(x) of the population, Therefore, 
each x; has mean Z and variance o2, 

The sample aggregate is the sequence of stochastic variables ry Th 
Tas... Tn, In terms of these variables we define the following: 


n 
The sample mean: x* = z X EnA 


k=1 


n 


The sample variance: s? = 1 » (te = 2*)%, 
n 


k=l 
The sample distribution: EEG). 
values of k for which ae A 


The sample relative f requency: f*(x). In case the samples are ai’ 
variables we let f*(x) be 1/n times the number of values of X for pier 
% =x. In the continuous case we subdivide the range of samP h 
values into disjoint, intervals and define J*(«) at the mid-point of eac 


: s- in the 
of these as 1/n times the number of values of k for which ae is in t 
subinterval. 


At this point we shou 


-of 
This is 1/n times the number © 


ild pause to emphasize that «* and 4 nts 
stochastic variables in contrast to Z and o? which are (unknown) oa 
stants. Furthermore, for each x, F*(x) is a stochastic variable, 2 
is f*(x), while F(z) and f(x) are fixed functions of z. 

First, let us compute a few moments: 


Theorem III, E(x*) = z, 


y . ; go 
This follows immediately from the fact that E (xx) = # for each k; 


1 
E(z*) = D E(x) = iene = 


Theorem IV. var(x*) = 


31% 


sec. 80] THE LAWS OF LARGE NUMBERS 199 


By Theorem V, Chap. 6 


var(x*) = X var (2) 


Theorem V. E(s?) = m= D, 


First, we note that 


s=- > (a? — 2Qx*x, + 1*2) 


1 5 i 
=-) g r”, 
n 


GE 
The result now follows as soon as we note that 


a(i Y a) -19 ze -19 e+e =o? + 2, 
n j n n 
and 


E(x*?) = var(x*) + [E@*)P = Z +7. 


As a direct application of (d) of Corollary 1 to Theorem I we have the 
ollowing: 


Theorem VI. x* converges in probability to x. 


From this and (b) of Theorem II it follows that a converges in 
Probability to #, By applying (d) of Corollary 1 to Theorem Ito the 
Sequence {v2}, we nee that Dx2/n converges in probability to o? T zs 

ombining these results by (a) of Theorem II, we have the following: 


Theorem VII. 52 converges in probability to o°. 


Tt woula appear from this last result that s? is a reasonable estimate 

oe It is, in the sense that for large n a single value of s? has a large 
Probability of being close to c?. However, if we take a fixed value of n 
Sig or small) and take a sequence of m sample aggregates, each oi 
( pe. of n samples, and compute s* for each one, then we can waged 
fing Corollary 1 to Theorem I to the sequence of sample variances to 


that as m— æ the arithmetic mean of these converges in proba- 


P ‘HAP. IL 
200 THEORY OF PROBABILITY [cu 


. . = is 
bility not to o? but to (n — 1)e?/n. For large n, this eee 
not worth worrying about, but in general we can make our a 
of a? more precise (in the sense just noted) by using, instead of $°, 


ns? 1 2 eid | 
a ld 103 k 


: j4, let us take s’ 
To return to the discussion of sample means in Sec. 64, let us tak 
as an estimate of e. Then by the central limit theorem, 


a 1 2/2 
(1) pr {ie ~a E a/v} ~ E Vin e772 dz, 
eee fies the 
A quick look at a table of the normal distribution now gives us 
results most commonly associated with s’: 


ee 675s’ 1 

(2) mr ket — a1 <7 | =e 
(3) ae Vier E ee ecg ed 
ot Anj 3 


Finally, let us note that, for each k, 
priz: < x} = F(x); 


: ‘ $ in 
so the variable F*(x) is, for each x, just the number of successes 


or 0 
n Bernoullian trials with p = F(x). We have seen in any number F, 
instances that for a Bernoullian sequence E(r/n) = p; so B(E*) = 
and the weak law of large n 


s 9 AD 
umbers applies (the Bernoullian oa 
be made to fit any of the conditions listed in Corollary 1 to Theore 
to tell us as follows: 


Theorem VIII. For each «, F*(x) converges in probability to F (e) 


In the discrete case (possible val 
set) the same argu 


probability to f (x). 
cated. To have the 
function, we must in 
number of Samples, 

shall not go into the 


ues of the samples form a dicie 
ment tells us that, for each x, f*(x) converge’ i- 
Tn the continuous case it is a little more pan 
sample relative frequency converge to the ane 
crease the number of intervals as we increase we 
A theorem of this sort can be obtained, but 
details here, 
81. Example—A Numerical Problem 
By way of illustration of the 
let us take a set of data and com: 
been talking about, The data in 


tiop, 
ideas suggested in the last se¢ e 


iae a 
pute some of the quantities ee some 
the accompanying table are jus 


src, 81] THE LAWS OF LARGE NUMBERS 201 


we made up, but they will serve as an example. We have assumed that 
the problem was one in which the sample variables automatically have 
integer values. Perhaps they are the count on something—e.g., the 
number of defective cartridges in a case or, the number of bad potatoes 
ma bushel. In our table the column labeled x gives the values of the 
sample variables; r is the number of samples producing the indicated 
value. The sum of the r column gives us n, the total number of sam- 
ples, and the next column, labeled r/n, gives what we have called the 
Sample relative frequency. The other columns are for convenience in 
computation. Note that the sum of the rz column gives Xx, and the 
sum of the rz? column gives Dx}. 


TABLE or HYPOTHETICAL SAMPLE DATA 


z r r/n rx rr? 
0 3 -003 0 0 
1 15 -015 15 15 
2 44 -044 88 176 
3 89 -089 267 801 
4 133 -133 532 2,128 
5 160 -160 800 4,000 
6 162 .162 972 5,832 
7 137 137 959 6,713 
8 102 .102 816 6,528 
9 69 .069 621 5,589 
10 41 O41 410 4,100 
11 25 025 275 3,025 
12 12 .012 144 1,728 
13 5 -005 65 845 
14 2 -002 28 312 
15 1 -001 15 225 
1,000 | 1.000 | 6,007 | 42,017 


From this table it appears that 


gë ODO? onn, 


~ 1,000 
1 aE amr 
TE : = = 
= 1,000 [42,017 1,000 5.975, 
w on 1 a uo = 
s? = ago | £2017 1,000 5.980, 


= = y .00598 = 0.0806. 


202 THEORY OF PROBABILITY [cnar. 1 
2 


On the basis of this information we can say that the population mean 


= should be approximately 6. In fact, from (3), Sec. 80 it follows that, 
if we estimate 
5.926 < Z < 6.088, 


there is a probability of approximately 24 that we are correct. . ; 
We should like to call special attention to the way in which T 
have stated the significance of (3), Sec. 80, in thisproblem. Through a 
the discussion in the preceding section we regarded x* as the stochasti¢ 
variable, not Z. Therefore, it would be improper to interpret (3), Sec- 
80, as saying that 
pr{5.926 < Z < 6.088} ~ 24. 
If Z has a probability distribution, we have not made any study of it; 
so we should not interpret our results as probabilities concerning ©. 
Finally, it follows from Theorem VIII that for each value of x there is 
a reasonably large probability that the value of r/n listed in the table 
is close to the value of the population probability function f(x). 


82. The a Posteriori Approach 


. f 
The weak law of large numbers applied to a Bernoullian sequence ° 
trials gives us the information that 


n= o 


> r 

lim pr {E -p< | =1. 

The direct physical interpretation of this is that, if p is known because 
of the nature of the experiment, the law of large numbers gives US 4 
prediction as to the nature of the results. Itis quite natural, howev i ‘4 
to use it the other way. Given the result of a large number of trv’ 
we might reasonably say that the law of large numbers lends credence to 
os assumption that the (presumably unknown) value of p is close 
r/n. 

However, the direct attack on this 1 
through Bayes’ theorem—(1), Sec. 21. 
hypothesis, and the experiment: 
bility for p on the basis of a k 
posteriori probability, 
we meet with the diffic 
theorem; viz., to find th 
we must have a priori 
particular case, we w 


to 


Lo be 
atter problem would seem Met 
After all, the value of P E þa- 
al ratio r/n is the result; so any P a 
nowledge. of r/n would seem to be om 
If we try to solve the problem in this manī! es’ 
ulty always encountered in the use of Tar js, 
e a posteriori probability of a certain hypoth 
probabilities for all possible hypotheses. +” 
ant the conditional probability that 


‘i 
a Pl = & 


SEC. 82] THE LAWS OF LARGE NUMBERS 203 


given that there were r successes in n trials. To use Bayes’ theorem, 
We need a set of a priori probabilities for all values of p between zero 
and one. 

For the sake of getting a problem set up, let us assume that, a priori 
P is uniformly distributed on the unit interval. That is, for 


O<@5b <1, 
pr{a<p<b} =b-a. 


There is no really convincing justification for this assumption, but it 
gives us something to work with. From it we can compute the condi- 
tional probabilities, , 
Prrnla Sp Sb}, 


given that there were r successes in n trials. Not only that, but we 
can prove the following: 


Theorem IX. If pr{a < p < b} = b — a, then 


3 rl Ss 
Jim aren {hE a] < = 


for every constant e > 0. 


Let us divide the unit interval up into k equal subdivisions. Then 
the events B; of (1), Sec. 21, are the events “p lies in the ith sub- 
Ivision,” which we shall approximate by p = i/k. Thus 


1 
pr{Bi} = 7 
Now, Al will be the event “r successes in n trials.” So, by Theorem i, 


ap. 4, 
prn{A} = "C, G) (: - 2) . 


Hence, by (1), See. 21, 


Prenta < p <b} : 
N N= 
c- ($) (1 E E 


Subj r P š 
Ubject to the approximations p = i/k used to represent the events B;. 
ese approximations become more accurate as k > co; and as this 


204. THEORY OF PROBABILITY [cmar. 11 
happens, the two sums above tend to definite integrals. Therefore, 
a a(l — 2)" dz 

is a'(1 — x)" dg 

1 b P 
St es a(l — x)" dx. 
serrr | ) 

In other words, the conditional probabilities for the values of p are 
represented by a stochastic variable whose density function is 


Parla < p <b} = 


(1 — x)" ; 
Br + In —7r +1) 
Now (Prob. 17, Chap. 7), this variable has mean 


r+1 
n+2 


@T+D)m—r+1) 
QEJF) 

Since this latter is clearly O(n), the result follows from Lemma 1- 

Having manufactured a probability distribution for p, we are o 
contrast to the situation in the preceding section) entitled to ma a 
statements of the form præ < p< B} =r. Before we regard this ee 
a great forward step, however, we should take a closer look at the inte? 
pretation that is to be placed on a statement of this sort. Stated $ 
words, Theorem IX says something to this effect: If a number P ? 
chosen at random between 0 and 1 and a Bernoullian sequence ° 
experiments devised having this number as the probability of success: 
then, knowing that this experiment produced r successes in n trials, w 
can say that the chances are that the number chosen was close to 4 
Tt is the necessity for an a priori assumption that robs this interpret? 
tion of most of its usefulness. Without such an assumption We a 
make direct probability statements about p, but we can say (bY i 
application of the weak law) that there is a large probability for j 
correctness of the obvious estimate p œ~ r/n. This being the cas® k 
follows from Theorem VITI that, if we consistently make estimates 
this sort, there is a large probability that we shall be right most of 
time. Most authorities prefer this latter analysis of the problem- 


and variance 


83. Proof of the Strong Law 


The necessary and su 


. arge 
ficient condition for the strong law of ! at 
numbers to hold is as ye 


t unknown. The best known theorem i5 


SEC. 83] THE LAWS OF LARGE NUMBERS 205 


of Kolmogoroff to the effect that the strong law holds provided that 


k= 
is convergent. We shall not attempt to prove this theorem here. 
Instead, we shall impose a much stronger hypothesis which makes the 
proof easier but is still satisfied in many practical cases. 


Lemma 2. If a stochastic variable x has a fourth moment, it has a 
Second moment, and E(x?) < [E(x*)]. 


This follows from Schwartz’s inequality (Theorem VIII, Chap. 6). 
Ve replace x by x? and y by 1, and Schwartz’s inequality says just what 
We want it to. 


Lemma 3. If x1, ta, 23, . . . isa sequence of independent stochastic 
Variables with z = 0 for each k, and if the fourth moments of the 2;’s 


are uniformly bounded, then 
; n Vy eo 2), 
E LO ra) ] O(n?) 


If the fourth moments are bounded, it follows from Lemma 2 that 
the second moments are too. Let K be the bound on all the second and 
Ourth moments. If we expand (Szx)4, we get terms of three types: 


terms of the form af 
3n(n — 1) terms of the form 23x32 
Other terms in which at least one variable appears to the first power 


Since each a; has mean zero and the variables are independent, each 
term in our third category has expectation zero. Hence, 


al (3) = J 20D +) BGDEGD 


< nK + 3n(n — 1)K? 
= O(n’). 


Theorem X. If a, £a 2, ...is a sequence of independent 
Stochastic variables with uniformly bounded fourth moments, then the 


Strong law of large numbers holds. 


. 11 
206 THEORY OF PROBABILITY [cHAP 


; h 
We apply Tshebysheff’s inequality (Theorem VI, Chap. 6) to the 
variable M2 and get 


1 
pr{ Mi > UEAN) < Fe 


Now, we assign ¢ so that 
[EQS] = n; 


then 
pr{|M,| > 4} < n*E(M8). 
By Lemma 3, 
E(M4) = nm: O(n?) = O(n); 
therefore 


pr{|M,| > ns} = n- Oln) = O(n). 


a ; for 
If Mn does not tend to zero, then for every k there is an z : a 
which |M,| > n>. For each k, the probability of having suc 
n > kis [by (e) of Theorem I, Chap. 2] less than or equal to 


Ey o 


riM] > n} = O(n). 
2 Pril | > ny A (n 


a Tse n 
Therefore, the probability that M,, does not tend to zero is less thal 
or equal to 


J O(n) 
n=k 
for every k. However, 20(n- 
mainder tends to zero as k — 
ber less than or equal to 


o _ tg Ye 
%) is a convergent series; 50 eae 
*. Hence, the only nonnegative 


GJ 


À O(n) 
nak 
for every k is zero itself, and the theorem is proved. hese 
If the xxs are identical and have fourth moments, then ‘pus 
moments are uniformly bounded—they are all the same. wa 
have the following corollary, applicable to many practical cases: 


Corollary 1. 


d 
ý 4 an 
If the variables 2; are independent and identical 
have fourth mo 


ments, then the strong law of large numbers holds- 
84. Example—Decimal Expansions 


Let us say that a numb 


ded 
that each integer OAS. . 


ee 
er has a normal decimal expansion p! ery 


in ev’ 
- - , 9 appears, on the average, once 


SEC. 84] THE LAWS OF LARGE NUMBERS 207 


10 decimal places. To be more specific, let r: (i = 0,1, 2, . . . , 9) be 
the number of times we find the integer 7 in the first n decimal places. 
Then, the decimal expansion is normal if, for each 7, 


The student probably considers the rational numbers the most 
familiar ones. Now, a rational number has a repeating decimal 
expansion; so the only rational numbers with normal decimal expan- 
Sions are those for which a cycle in the expansion contains each of the 
digits 0, 1, 2, ..., 9 the same number of times. The simplest of 
these would be 


123456789 kai ERY 

9999999999 — .01234567890123456789 - ++. 
_ However, the strong law of large numbers tells us that, if a number ¢ 
1s chosen at random between 0 and 1, then the probability that its 
decimal expansion is normal is equal to unity. To see this, we apply 
the strong law ten times. For example, suppose we are interested in 
PS. Let 

i if the kth decimal place is a 7, 
Ty = à 
0 otherwise. 


Clearly, the variables a, are identical and have fourth moments. By 
2. y 

Prob, 33, Chap. 3, they are totally independent; so Corollary 1 to 
Theorem X applies and tells us that 


ee B) 
or {iim ), = 79} 5 


kæ 


N ow, 


? 


50 the result is proved as far as 7’s are concerned. 
The way to extend this to all 10 digits simultaneously is to consider 
ae complementary events. We see from the above argument that 
go OE oe EN, egy 
pr fim n g i 
mi each ¿ (¿ = 0, 1,2, . . . , 9). For different values of 7, thee vents 
re not mutually exclusive; but by (e) of Theorem I, Chap. 2, the prob- 


208 ‘ THEORY OF PROBABILITY {cnap. 11 


ability of failure for at least one 7 is less than or equal to the sum of 
zeros. Therefore, the probability of a normal decimal expansion is 1. 


REFERENCES FOR FURTHER STUDY 


Cramér, Mathematical Methods of Statistics, Chaps. 20, 25, 27. E 
Uspensky, Introduction to Mathematical Probability, Chaps. VI, X, XI 


PROBLEMS 


1. Show that the strong (and therefore the weak) law of large num- 
bers holds for a Bernoullian sequence of trials. . 

2. Show that the strong law of large numbers holds for any Poisson 
sequence of trials. (Compare Prob. 5, Chap. 9.) There are 5 
Poisson sequences for which both laws of large numbers hold but the 
central limit theorem does not. 

3. Show that, as the number of degrees of freedom tends to %» 
x?/n— 1 with probability 1. Hint: Show that Corollary 1 tO 
Theorem X applies to Prob. 10, Chap. 8. 

4. Apply Tshebysheff’s inequality (Theorem VI, Chap. 6) to ® 
Bernoullian sequence of trials to get an estimate of 


ileal 


2). 
Ans. This probability is greater than 1 — (pg/ne ) 
5. Show that, for x > 0, 


= 
z D 


Í PETE peat 
z T 


af e: dz Fi z0? dz, 
T 


z 


Hint: 


and this last expression can b 
6. Apply Prob. 5 to the esti: 
lian trials to show that, for 1 


m{E 


a P 
7. Compare Probs. 4 and 6. 


e integrated to give the result. 
mate (1), Sec. 80, for the case of Berm 
arge n, 


oul- 


Ss 4 <2 Vr e-en/200, 
eVn 


Note that 
orez = o(a-) 
as t> ow, 


n 
(Prove this by Lhopital’s rule.) Therefore, for larg? 


THE LAWS OF LARGE NUMBERS 209 


the central limit theorem gives a closer estimate than Tshebysheft’s 
inequality. 

8. Let 21, 22, . . . , 2, be independent random samples from some 
population. Let an be the mth moment of the population about zero, 
and let 


Assuming that a, exists for every m, show that am converges in proba- 
bility to om. 

9. Using Theorem II, show that any polynomial P(a;, a2, . . . Cia) 
in the sample moments (see Prob. 8) converges in probability to the 
corresponding combination of the population moments. 

10. Extend Prob. 9 to rational functions of the sample moments, 
subject to the provision that the denominator is different from zero 
when the population moments are substituted. 

11. Let £ v2, .. . , 2, and yi, yz, - - - , Yn be independent sample 
Measurements of 2 different quantities from the same population. 
(For example, x might be the height of a child, y his weight.) Show 
that the sample correlation coefficient 


X (a; — *) (yr — y*) 


ik 
[(2@x — 2*)?)(ZH — y*)?) 4 
Converges in probability to the population correlation coefficient for 
these 2 quantities. 

12. Find the characteristic function for the distribution in which 
pr{x = a} = 1, pr{x #a} = 0. Ans. es, 

_ 13. Show that, if x has a finite first moment, the characteristic func- 
tion for x may be written g(t) = 1 + üz + o(d). 

14. Using Probs. 12 and 13, prove Khintchine’s theorem: If the 
Variables 1, v2, 23, . . . are independent and identical and have finite 

tst moments, then the weak law of large numbers holds. 

15. Show that to prove Theorem VII by means of (d) of Corollary 1 
to Theorem I we must assume that the population has a fourth moment, 
While the same result can be obtained with no more than second 
Moments by using Prob. 14. 

16. Show that the a posteriori probabilities of Theorem IX are equal 
to certain a priori Bernoullian probabilities. Specifically, the condi- 
tional probability (given r successes in n trials) that p < (r/n) — e is 
equal to the probability of at least r + 1 successes inn + 1 Bernoullian 


210 THEORY OF PROBABILITY [cnap. 11 


trials with probability of success (r/n) — e on each trial. Hint: See 
Prob. 25, Chap. 4, and Prob. 11, Chap. 7. 
17. Using Prob. 16 and applying the weak law of large numbers to 
the auxiliary Bernoullian sequence, give another proof of Theorem IX. 
18. Show that, if x is chosen at random between —z and ~, then, 


given e > 0, 
1 š 
pr | = ) sin ka 


kai 
Note Prob. 33, Chap. 6, and apply (b) of Corollary 


<e}oi 


asn—o,. Hint: 
1 to Theorem I, 
19. Let the functions 


w(t) be defined on the unit interval 0 <i<l 
as follows: 


| = 2i — 1 
m l for aS Š <i< + 
2 = n Pr 
-1 for 7h cp ct 


i Ai » 21). Draw the first three v(t) ’s. ie 
20. Show that, if ¢ is chosen at random between 0 and 1, the st 
chastic variables x(t) of Prob. 19 are independent. 


21. Show that, if t is chosen at random between 0 and 1, 


pr k Yaa 0 


k=1 


= 1. 


22. Let ¿ be chosen at random between 0 and 1, and let m(t) pe 
the arithmetic mean of the first n digits in the decimal expansion 0f © 
Show that pr{m,(t) = 4.5} = 1, at 

23. Generalize the example on decimal expansions at the end of E 
chapter. Consider dyadic expansions (see Chap. 8) and, in genet 


ž f ow 
expansions in powers of 1/N where N is any positive integer. aio 
that each of these expansions has the normality property W 
probability 1. 


, for 
24. A normal number is one which has the normality property a 
every expansion of the type Suggested in Prob. 23. Show that W ow 
probability 1 a number chosen at random is normal. Hint: Sb 
that the probability of f. 


f 
. o m si 
the ailure here is less than or equal to the su 
an infinite sequence of zeros, 


INDEX 


A 


Addition formula, general, 69-70 
Addition principle, 14, 16, 18 
Axioms for probability, 18 


B 


Bacteria counts, 187 

Bayes’ theorem, 43-45, 50-51, 202-204 
Bernoulli, Jakob, 6l 

Bernoulli’ s formule, 63-67 

Bernoullian sequence of trials, 61-63, 


112-113, 140 
expectation of number of successes 
in, 112 
limit laws for, 148, 158-159, 168-174, 
202-204 


probability of r successes in, 63 
normal approximation to, 172-174, 
183-186 
Poisson approximation to, 179-186 
variance of number of successes in, 
112 
Beta function, 122-124 
Binomial coefficient, 5-6 
Binomial theorem, 6, 8 
Bombardment of screen, 27-28, 83 
Buffon needle problem, 53-54 


Cc 


Cauchy’s distribution, 33, 95, 99, 167- 
168 
Central limit law, 137-139 
compared with weak law of large 
numbers, 141-142 
Central limit theorem, 151-174 
for Bernoullian sequences, 168-174 
Cauchy’s counterexample, 167-168 
for identical variables, 153-154, 164- 
165 


211 


Central limit theorem, Linpounoft's con- 
dition, 166-167 
Lindeberg’s condition, 153-154 
for normally distributed variables, 
152 
Chance variable (see Stochastic vari- 
ables) 
Characteristic function, 161-164 
inversion formulas for, 161-163 
of normal distribution, 164 
of a sum, 163 
Chi-square (x?) distribution, 134, 160 
Chi-square (x?) test, 159-161, 175 
Combination, 2-5 
vacuous, 3 
Combination numbers, 2 
Complement, 17 
Convergence in probability, 196-197, 
199-200 
Convolution, 88-91 
Correlation, 113-118 
coefficient of, 114-115 
normal, 115-118 
Covariance, 113-115 


D 


Density function, 24-27 
conditional, 40-41 
joint, 35-37, 45-46, 86 
in independent case, 49-50 
marginal, 37-38 
transformations on, 76-81 
Dice, 14-16, 38-39, 82 
Discrete space, 17 
Dispersion (see Variance) 
Distribution function, 16, 18, 22 
joint, 35 
marginal, 37 
Duhamel’s theorem, 26 
Dyadic expansion, 144 


212 
E 


Equally likely results, 11-12 
Event space, 16 
Events, 16-18 

component, 34 

compound, 34, 86 

independent, 47-48 
Expectation, 98, 101-105 

of product, 105, 113-114 

of sum, 104-105 
Exponential, complex, 130-132 
Extrasensory perception, 107-108 


F 


Feller, W., 139, 153, 171 
Frequency, 10-11 
(See also Sample frequency) 


G 


Gamma function, 122-124 

Gaussian distribution (see Normal dis- 
tribution) 

Geiger counter, 189-190 

Genetics, 54-56 


I 


Infinitesimal, 26, 27 

Integral, 23-27, 29-30, 36 
absolutely convergent, 36, 97-98 
improper, 36 
Lebesgue, 24 
multiple, 36 
Riemann, 24, 26 
Stieltjes, 29-30, 97 

Iterated logarithm, law of, 146-147 


K 
Kolmogoroff, 147, 205 

L 
Laplace, 138, 166 


Large numbers, laws of, 140-146 
strong, 143, 204-206 


THEORY OF PROBABILITY 


Large numbers, laws of, weak, 140 
compared with central limit law, 
141, 142 
conditions for, 194-195 
for identical variables, 195, 209 
Legendre, formula of, 124, 133 
Lhopital’s rule, 135 
Liapounoff, 138, 166 
Lindeberg, 139, 153 
Logarithmiconormal distribution, 120, 
176 


M 


Maclaurin’s formula with remainder, 
135 
Matching problem, 70-72, 107-108 
Mean, 98 
(See also Sample mean) 
Measurable set, 17 
Median, 100 
Moivre, de, 126, 138 
Molina’s tables, 185-186 
Moments, 97-98 
Multiplication theorem, 42-43 
Mutually exclusive events, 17-18 


N 


Normal decimal expansion, 206-208 
Normal distribution, 91-94 
in central limit law, 138 
circular, 93-94 
definition of, 91 
. Moments of, 98-99 
in nature, 155-156 
of sample mean, 156-159 
Normal number, 210 94 
Normally distributed variables, 91- 16- 
linear combinations of, 92-93, 1 
117 


(0) 


0, o notation, 124-125 


P 


Pascal’s triangle, 3 
Permutation, 1 


INDEX 213 


Poisson distribution, 32, 95, 119, 139- 
140, 148, 170, 179-190 
as approximation to Bernoullian 
probabilities, 179-186, 190 
as exact distribution, 187-190 
examples of, 186-187, 189, 191-192 
Poisson sequence of trials, 119, 148, 175, 
177 
Poker hands, 4-5 
Polya, G., 139 
Probability, 10-12, 18-19 
axiomatic definition of, 18 
conditional, 39-42, 43 
in continuous case, 23-27 
in discrete case, 21-23 
elementary definition of, 11 
in mixed case, 29-30 
zero, 19, 25, 143 
Probability function, 21-23 
conditional, 41-42 
joint, 41-42, 45-46 
marginal, 41-42 
transformations on, 77, 80-81 
Product of events, 17 


R 


Radium atom, 28-29, 79-80, 83-84, 
100-101 

Random point on line, 25-26, 51-53 

Random variable (see Stochastic vari- 
able) 

Random walk problem, 67-69 

Results, equally likely, 11-12 

Roulette, 12-14 


S 
St. Petersburg problem, 119 


Sample, 62 
Sample distribution, 198-200 


Sample frequency, 198-200 
Sample mean, 156-159, 198-200 
Sample variance, 198-200 
Schwartz’s inequality, 115 
Series, 22 
absolutely convergent, 22, 97-98 
Standard deviation, 98 
Stirling’s theorem, 126-130, 171 
Stochastic variables, 19-21, 76-80 
definition of, 20 
equal, 87 
functions of, 76-91 
density functions for, 76-81 
identical, 87 
central limit theorem for, 153-154, 
164-165 
law of large numbers for, 195, 209 
independent, 46-48 
products of, 84-88 
sums of, 84-91 
in independent case, 88-91 
translation and change of scale in, 
80-81 
uncorrelated, 113-114 
by pairs, 114 
Sum of events, 17 


T 


Telephone problem, 186-187 
Tshebysheff’s inequality, 111-112 


U 


Urn, drawing balls from, 23, 50-51, 82, 
105-107 
Uspensky, J. V., 2, 54 


¥ 


Variance, 98, 108-112 
of sum, 110-111, 114 


- Teon "eI 
Bg Fey D Tog aii 


